Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä [email protected] Carlos...

112
Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä [email protected] Carlos Guestrin University of Washington & CMU Guy Blelloch CMU Alex Smola CMU Dave Andersen CMU Jure Leskovec Stanford Thesis Committee: 1

Transcript of Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä [email protected] Carlos...

Page 1: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

1

Thesis DefenseLarge-Scale Graph Computation on

Just a PC

Aapo Kyroumllauml akyrolacscmuedu

Carlos GuestrinUniversity of

Washington amp CMU

Guy BlellochCMU

Alex SmolaCMU

Dave AndersenCMU

Jure LeskovecStanford

Thesis Committee

2

Research Fields

This thesis

Machine Learning

Data Mining

Graph analysis

and mining

External memory

algorithms research

Systems research

Databases

Motivation and applications

Research contributions

3

Why Graphs

Large-Scale Graph Computation on Just a PC

4

BigData with Structure BigGraph

social graph social graph follow-graph consumer-products graph

user-movie ratings graph

DNA interaction

graph

WWWlink graph

Communication networks (but ldquoonly 3 hopsrdquo)

5

Why on a single machine

Canrsquot we just use the Cloud

Large-Scale Graph Computation on a Just a PC

6

Why use a cluster

Two reasons1 One computer cannot handle my graph problem

in a reasonable time

2 I need to solve the problem very fast

7

Why use a cluster

Two reasons1 One computer cannot handle my graph problem

in a reasonable time

2 I need to solve the problem very fast

Our work expands the space of feasible problems on one machine (PC)- Our experiments use the same graphs or bigger than previous papers on distributed graph computation (+ we can do Twitter graph on a laptop)

Our work raises the bar on required performance for a ldquocomplicatedrdquo system

8

Benefits of single machine systems

Assuming it can handle your big problemshellip1 Programmer productivityndash Global state debuggershellip

2 Inexpensive to install administer less power

3 Scalabilityndash Use cluster of single-machine systems

to solve many tasks in parallel Idea Trade latency

for throughputlt 32K bitssec

9

Computing on Big Graphs

Large-Scale Graph Computation on Just a PC

10

Big Graphs = Big Data

Data size 140 billion connections

asymp 1 TB

Not a problem

Computation

Hard to scale

Twitter network visualization by Akshay Java 2009

11

Research Goal

Compute on graphs with billions of edges in a reasonable time on a single PC

ndash Reasonable = close to numbers previously reported for distributed systems in the literature

Experiment PC Mac Mini (2012)

12

Terminology

bull (Analytical) Graph ComputationndashWhole graph is processed typically for

several iterations vertex-centric computation

ndash Examples Belief Propagation Pagerank Community detection Triangle Counting Matrix Factorization Machine Learninghellip

bull Graph Queries (database)ndash Selective graph queries (compare to SQL

queries)ndash Traversals shortest-path friends-of-

friendshellip

13

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Thesis statement

The Parallel Sliding Windows algorithm and the Partitioned Adjacency Lists data structure enable

computation on very large graphs in external memory on just a personal computer

14

DISK-BASED GRAPH COMPUTATION

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 2: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

2

Research Fields

This thesis

Machine Learning

Data Mining

Graph analysis

and mining

External memory

algorithms research

Systems research

Databases

Motivation and applications

Research contributions

3

Why Graphs

Large-Scale Graph Computation on Just a PC

4

BigData with Structure BigGraph

social graph social graph follow-graph consumer-products graph

user-movie ratings graph

DNA interaction

graph

WWWlink graph

Communication networks (but ldquoonly 3 hopsrdquo)

5

Why on a single machine

Canrsquot we just use the Cloud

Large-Scale Graph Computation on a Just a PC

6

Why use a cluster

Two reasons1 One computer cannot handle my graph problem

in a reasonable time

2 I need to solve the problem very fast

7

Why use a cluster

Two reasons1 One computer cannot handle my graph problem

in a reasonable time

2 I need to solve the problem very fast

Our work expands the space of feasible problems on one machine (PC)- Our experiments use the same graphs or bigger than previous papers on distributed graph computation (+ we can do Twitter graph on a laptop)

Our work raises the bar on required performance for a ldquocomplicatedrdquo system

8

Benefits of single machine systems

Assuming it can handle your big problemshellip1 Programmer productivityndash Global state debuggershellip

2 Inexpensive to install administer less power

3 Scalabilityndash Use cluster of single-machine systems

to solve many tasks in parallel Idea Trade latency

for throughputlt 32K bitssec

9

Computing on Big Graphs

Large-Scale Graph Computation on Just a PC

10

Big Graphs = Big Data

Data size 140 billion connections

asymp 1 TB

Not a problem

Computation

Hard to scale

Twitter network visualization by Akshay Java 2009

11

Research Goal

Compute on graphs with billions of edges in a reasonable time on a single PC

ndash Reasonable = close to numbers previously reported for distributed systems in the literature

Experiment PC Mac Mini (2012)

12

Terminology

bull (Analytical) Graph ComputationndashWhole graph is processed typically for

several iterations vertex-centric computation

ndash Examples Belief Propagation Pagerank Community detection Triangle Counting Matrix Factorization Machine Learninghellip

bull Graph Queries (database)ndash Selective graph queries (compare to SQL

queries)ndash Traversals shortest-path friends-of-

friendshellip

13

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Thesis statement

The Parallel Sliding Windows algorithm and the Partitioned Adjacency Lists data structure enable

computation on very large graphs in external memory on just a personal computer

14

DISK-BASED GRAPH COMPUTATION

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 3: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

3

Why Graphs

Large-Scale Graph Computation on Just a PC

4

BigData with Structure BigGraph

social graph social graph follow-graph consumer-products graph

user-movie ratings graph

DNA interaction

graph

WWWlink graph

Communication networks (but ldquoonly 3 hopsrdquo)

5

Why on a single machine

Canrsquot we just use the Cloud

Large-Scale Graph Computation on a Just a PC

6

Why use a cluster

Two reasons1 One computer cannot handle my graph problem

in a reasonable time

2 I need to solve the problem very fast

7

Why use a cluster

Two reasons1 One computer cannot handle my graph problem

in a reasonable time

2 I need to solve the problem very fast

Our work expands the space of feasible problems on one machine (PC)- Our experiments use the same graphs or bigger than previous papers on distributed graph computation (+ we can do Twitter graph on a laptop)

Our work raises the bar on required performance for a ldquocomplicatedrdquo system

8

Benefits of single machine systems

Assuming it can handle your big problemshellip1 Programmer productivityndash Global state debuggershellip

2 Inexpensive to install administer less power

3 Scalabilityndash Use cluster of single-machine systems

to solve many tasks in parallel Idea Trade latency

for throughputlt 32K bitssec

9

Computing on Big Graphs

Large-Scale Graph Computation on Just a PC

10

Big Graphs = Big Data

Data size 140 billion connections

asymp 1 TB

Not a problem

Computation

Hard to scale

Twitter network visualization by Akshay Java 2009

11

Research Goal

Compute on graphs with billions of edges in a reasonable time on a single PC

ndash Reasonable = close to numbers previously reported for distributed systems in the literature

Experiment PC Mac Mini (2012)

12

Terminology

bull (Analytical) Graph ComputationndashWhole graph is processed typically for

several iterations vertex-centric computation

ndash Examples Belief Propagation Pagerank Community detection Triangle Counting Matrix Factorization Machine Learninghellip

bull Graph Queries (database)ndash Selective graph queries (compare to SQL

queries)ndash Traversals shortest-path friends-of-

friendshellip

13

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Thesis statement

The Parallel Sliding Windows algorithm and the Partitioned Adjacency Lists data structure enable

computation on very large graphs in external memory on just a personal computer

14

DISK-BASED GRAPH COMPUTATION

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 4: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

4

BigData with Structure BigGraph

social graph social graph follow-graph consumer-products graph

user-movie ratings graph

DNA interaction

graph

WWWlink graph

Communication networks (but ldquoonly 3 hopsrdquo)

5

Why on a single machine

Canrsquot we just use the Cloud

Large-Scale Graph Computation on a Just a PC

6

Why use a cluster

Two reasons1 One computer cannot handle my graph problem

in a reasonable time

2 I need to solve the problem very fast

7

Why use a cluster

Two reasons1 One computer cannot handle my graph problem

in a reasonable time

2 I need to solve the problem very fast

Our work expands the space of feasible problems on one machine (PC)- Our experiments use the same graphs or bigger than previous papers on distributed graph computation (+ we can do Twitter graph on a laptop)

Our work raises the bar on required performance for a ldquocomplicatedrdquo system

8

Benefits of single machine systems

Assuming it can handle your big problemshellip1 Programmer productivityndash Global state debuggershellip

2 Inexpensive to install administer less power

3 Scalabilityndash Use cluster of single-machine systems

to solve many tasks in parallel Idea Trade latency

for throughputlt 32K bitssec

9

Computing on Big Graphs

Large-Scale Graph Computation on Just a PC

10

Big Graphs = Big Data

Data size 140 billion connections

asymp 1 TB

Not a problem

Computation

Hard to scale

Twitter network visualization by Akshay Java 2009

11

Research Goal

Compute on graphs with billions of edges in a reasonable time on a single PC

ndash Reasonable = close to numbers previously reported for distributed systems in the literature

Experiment PC Mac Mini (2012)

12

Terminology

bull (Analytical) Graph ComputationndashWhole graph is processed typically for

several iterations vertex-centric computation

ndash Examples Belief Propagation Pagerank Community detection Triangle Counting Matrix Factorization Machine Learninghellip

bull Graph Queries (database)ndash Selective graph queries (compare to SQL

queries)ndash Traversals shortest-path friends-of-

friendshellip

13

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Thesis statement

The Parallel Sliding Windows algorithm and the Partitioned Adjacency Lists data structure enable

computation on very large graphs in external memory on just a personal computer

14

DISK-BASED GRAPH COMPUTATION

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 5: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

5

Why on a single machine

Canrsquot we just use the Cloud

Large-Scale Graph Computation on a Just a PC

6

Why use a cluster

Two reasons1 One computer cannot handle my graph problem

in a reasonable time

2 I need to solve the problem very fast

7

Why use a cluster

Two reasons1 One computer cannot handle my graph problem

in a reasonable time

2 I need to solve the problem very fast

Our work expands the space of feasible problems on one machine (PC)- Our experiments use the same graphs or bigger than previous papers on distributed graph computation (+ we can do Twitter graph on a laptop)

Our work raises the bar on required performance for a ldquocomplicatedrdquo system

8

Benefits of single machine systems

Assuming it can handle your big problemshellip1 Programmer productivityndash Global state debuggershellip

2 Inexpensive to install administer less power

3 Scalabilityndash Use cluster of single-machine systems

to solve many tasks in parallel Idea Trade latency

for throughputlt 32K bitssec

9

Computing on Big Graphs

Large-Scale Graph Computation on Just a PC

10

Big Graphs = Big Data

Data size 140 billion connections

asymp 1 TB

Not a problem

Computation

Hard to scale

Twitter network visualization by Akshay Java 2009

11

Research Goal

Compute on graphs with billions of edges in a reasonable time on a single PC

ndash Reasonable = close to numbers previously reported for distributed systems in the literature

Experiment PC Mac Mini (2012)

12

Terminology

bull (Analytical) Graph ComputationndashWhole graph is processed typically for

several iterations vertex-centric computation

ndash Examples Belief Propagation Pagerank Community detection Triangle Counting Matrix Factorization Machine Learninghellip

bull Graph Queries (database)ndash Selective graph queries (compare to SQL

queries)ndash Traversals shortest-path friends-of-

friendshellip

13

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Thesis statement

The Parallel Sliding Windows algorithm and the Partitioned Adjacency Lists data structure enable

computation on very large graphs in external memory on just a personal computer

14

DISK-BASED GRAPH COMPUTATION

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 6: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

6

Why use a cluster

Two reasons1 One computer cannot handle my graph problem

in a reasonable time

2 I need to solve the problem very fast

7

Why use a cluster

Two reasons1 One computer cannot handle my graph problem

in a reasonable time

2 I need to solve the problem very fast

Our work expands the space of feasible problems on one machine (PC)- Our experiments use the same graphs or bigger than previous papers on distributed graph computation (+ we can do Twitter graph on a laptop)

Our work raises the bar on required performance for a ldquocomplicatedrdquo system

8

Benefits of single machine systems

Assuming it can handle your big problemshellip1 Programmer productivityndash Global state debuggershellip

2 Inexpensive to install administer less power

3 Scalabilityndash Use cluster of single-machine systems

to solve many tasks in parallel Idea Trade latency

for throughputlt 32K bitssec

9

Computing on Big Graphs

Large-Scale Graph Computation on Just a PC

10

Big Graphs = Big Data

Data size 140 billion connections

asymp 1 TB

Not a problem

Computation

Hard to scale

Twitter network visualization by Akshay Java 2009

11

Research Goal

Compute on graphs with billions of edges in a reasonable time on a single PC

ndash Reasonable = close to numbers previously reported for distributed systems in the literature

Experiment PC Mac Mini (2012)

12

Terminology

bull (Analytical) Graph ComputationndashWhole graph is processed typically for

several iterations vertex-centric computation

ndash Examples Belief Propagation Pagerank Community detection Triangle Counting Matrix Factorization Machine Learninghellip

bull Graph Queries (database)ndash Selective graph queries (compare to SQL

queries)ndash Traversals shortest-path friends-of-

friendshellip

13

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Thesis statement

The Parallel Sliding Windows algorithm and the Partitioned Adjacency Lists data structure enable

computation on very large graphs in external memory on just a personal computer

14

DISK-BASED GRAPH COMPUTATION

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 7: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

7

Why use a cluster

Two reasons1 One computer cannot handle my graph problem

in a reasonable time

2 I need to solve the problem very fast

Our work expands the space of feasible problems on one machine (PC)- Our experiments use the same graphs or bigger than previous papers on distributed graph computation (+ we can do Twitter graph on a laptop)

Our work raises the bar on required performance for a ldquocomplicatedrdquo system

8

Benefits of single machine systems

Assuming it can handle your big problemshellip1 Programmer productivityndash Global state debuggershellip

2 Inexpensive to install administer less power

3 Scalabilityndash Use cluster of single-machine systems

to solve many tasks in parallel Idea Trade latency

for throughputlt 32K bitssec

9

Computing on Big Graphs

Large-Scale Graph Computation on Just a PC

10

Big Graphs = Big Data

Data size 140 billion connections

asymp 1 TB

Not a problem

Computation

Hard to scale

Twitter network visualization by Akshay Java 2009

11

Research Goal

Compute on graphs with billions of edges in a reasonable time on a single PC

ndash Reasonable = close to numbers previously reported for distributed systems in the literature

Experiment PC Mac Mini (2012)

12

Terminology

bull (Analytical) Graph ComputationndashWhole graph is processed typically for

several iterations vertex-centric computation

ndash Examples Belief Propagation Pagerank Community detection Triangle Counting Matrix Factorization Machine Learninghellip

bull Graph Queries (database)ndash Selective graph queries (compare to SQL

queries)ndash Traversals shortest-path friends-of-

friendshellip

13

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Thesis statement

The Parallel Sliding Windows algorithm and the Partitioned Adjacency Lists data structure enable

computation on very large graphs in external memory on just a personal computer

14

DISK-BASED GRAPH COMPUTATION

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 8: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

8

Benefits of single machine systems

Assuming it can handle your big problemshellip1 Programmer productivityndash Global state debuggershellip

2 Inexpensive to install administer less power

3 Scalabilityndash Use cluster of single-machine systems

to solve many tasks in parallel Idea Trade latency

for throughputlt 32K bitssec

9

Computing on Big Graphs

Large-Scale Graph Computation on Just a PC

10

Big Graphs = Big Data

Data size 140 billion connections

asymp 1 TB

Not a problem

Computation

Hard to scale

Twitter network visualization by Akshay Java 2009

11

Research Goal

Compute on graphs with billions of edges in a reasonable time on a single PC

ndash Reasonable = close to numbers previously reported for distributed systems in the literature

Experiment PC Mac Mini (2012)

12

Terminology

bull (Analytical) Graph ComputationndashWhole graph is processed typically for

several iterations vertex-centric computation

ndash Examples Belief Propagation Pagerank Community detection Triangle Counting Matrix Factorization Machine Learninghellip

bull Graph Queries (database)ndash Selective graph queries (compare to SQL

queries)ndash Traversals shortest-path friends-of-

friendshellip

13

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Thesis statement

The Parallel Sliding Windows algorithm and the Partitioned Adjacency Lists data structure enable

computation on very large graphs in external memory on just a personal computer

14

DISK-BASED GRAPH COMPUTATION

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 9: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

9

Computing on Big Graphs

Large-Scale Graph Computation on Just a PC

10

Big Graphs = Big Data

Data size 140 billion connections

asymp 1 TB

Not a problem

Computation

Hard to scale

Twitter network visualization by Akshay Java 2009

11

Research Goal

Compute on graphs with billions of edges in a reasonable time on a single PC

ndash Reasonable = close to numbers previously reported for distributed systems in the literature

Experiment PC Mac Mini (2012)

12

Terminology

bull (Analytical) Graph ComputationndashWhole graph is processed typically for

several iterations vertex-centric computation

ndash Examples Belief Propagation Pagerank Community detection Triangle Counting Matrix Factorization Machine Learninghellip

bull Graph Queries (database)ndash Selective graph queries (compare to SQL

queries)ndash Traversals shortest-path friends-of-

friendshellip

13

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Thesis statement

The Parallel Sliding Windows algorithm and the Partitioned Adjacency Lists data structure enable

computation on very large graphs in external memory on just a personal computer

14

DISK-BASED GRAPH COMPUTATION

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 10: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

10

Big Graphs = Big Data

Data size 140 billion connections

asymp 1 TB

Not a problem

Computation

Hard to scale

Twitter network visualization by Akshay Java 2009

11

Research Goal

Compute on graphs with billions of edges in a reasonable time on a single PC

ndash Reasonable = close to numbers previously reported for distributed systems in the literature

Experiment PC Mac Mini (2012)

12

Terminology

bull (Analytical) Graph ComputationndashWhole graph is processed typically for

several iterations vertex-centric computation

ndash Examples Belief Propagation Pagerank Community detection Triangle Counting Matrix Factorization Machine Learninghellip

bull Graph Queries (database)ndash Selective graph queries (compare to SQL

queries)ndash Traversals shortest-path friends-of-

friendshellip

13

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Thesis statement

The Parallel Sliding Windows algorithm and the Partitioned Adjacency Lists data structure enable

computation on very large graphs in external memory on just a personal computer

14

DISK-BASED GRAPH COMPUTATION

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 11: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

11

Research Goal

Compute on graphs with billions of edges in a reasonable time on a single PC

ndash Reasonable = close to numbers previously reported for distributed systems in the literature

Experiment PC Mac Mini (2012)

12

Terminology

bull (Analytical) Graph ComputationndashWhole graph is processed typically for

several iterations vertex-centric computation

ndash Examples Belief Propagation Pagerank Community detection Triangle Counting Matrix Factorization Machine Learninghellip

bull Graph Queries (database)ndash Selective graph queries (compare to SQL

queries)ndash Traversals shortest-path friends-of-

friendshellip

13

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Thesis statement

The Parallel Sliding Windows algorithm and the Partitioned Adjacency Lists data structure enable

computation on very large graphs in external memory on just a personal computer

14

DISK-BASED GRAPH COMPUTATION

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 12: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

12

Terminology

bull (Analytical) Graph ComputationndashWhole graph is processed typically for

several iterations vertex-centric computation

ndash Examples Belief Propagation Pagerank Community detection Triangle Counting Matrix Factorization Machine Learninghellip

bull Graph Queries (database)ndash Selective graph queries (compare to SQL

queries)ndash Traversals shortest-path friends-of-

friendshellip

13

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Thesis statement

The Parallel Sliding Windows algorithm and the Partitioned Adjacency Lists data structure enable

computation on very large graphs in external memory on just a personal computer

14

DISK-BASED GRAPH COMPUTATION

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 13: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

13

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Thesis statement

The Parallel Sliding Windows algorithm and the Partitioned Adjacency Lists data structure enable

computation on very large graphs in external memory on just a personal computer

14

DISK-BASED GRAPH COMPUTATION

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 14: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

14

DISK-BASED GRAPH COMPUTATION

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 15: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

15

GraphChi (Parallel Sliding Windows)

Batchcomp

Evolving graph

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Matrix Factorization

Loopy Belief PropagationCo-EM

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

DrunkardMob Parallel Random walk simulation

GraphChi^2

Incremental comp

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Graph sampling

Graph Queries

Graph traversal

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 16: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

16

Computational Model

bull Graph G = (V E) ndash directed edges e =

(source destination)ndash each edge and vertex

associated with a value (user-defined type)

ndash vertex and edge values can be modifiedbull (structure modification also

supported) Data

Data Data

Data

DataDataData

DataData

Data

A Be

Terms e is an out-edge of A and in-edge of B

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 17: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

17

Data

Data Data

Data

DataDataData

Data

Data

Data

Vertex-centric Programming

bull ldquoThink like a vertexrdquobull Popularized by the Pregel and GraphLab

projectsndash Historically systolic computation and the Connection Machine

MyFunc(vertex) modify neighborhood Data

Data Data

Data

Data

function Pagerank(vertex) insum = sum(edgevalue for edge in vertexinedges) vertexvalue = 085 + 015 insum foreach edge in vertexoutedges edgevalue = vertexvalue vertexnum_outedges

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 18: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

18

Computational Setting

ConstraintsA Not enough memory to store the whole

graph in memory nor all the vertex values

B Enough memory to store one vertex and its edges w associated values

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 19: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

19

The Main Challenge of Disk-based Graph Computation

Random Access

~ 100K reads sec (commodity)~ 1M reads sec (high-end arrays)

ltlt 5-10 M random edges sec to achieve ldquoreasonable performancerdquo

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 20: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

20

Random Access Problem

File edge-values

A in-edges A out-edges

A B

B in-edges B out-edges

x

Moral You can either access in- or out-edges sequentially but not both

Random write

Random readProcessing sequentially

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 21: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

21

Our Solution

Parallel Sliding Windows (PSW)

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 22: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

22

Parallel Sliding Windows Phases

bull PSW processes the graph one sub-graph a time

bull In one iteration the whole graph is processedndash And typically next iteration is started

1 Load

2 Compute

3 Write

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 23: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

23

bull Vertices are numbered from 1 to nndash P intervalsndash sub-graph = interval of vertices

PSW Shards and Intervals

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 10000100 700

1 Load

2 Compute

3 Write

ltltpartition-bygtgt

source destinationedge

In shards edges sorted by source

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 24: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

24

Example Layout

Shard 1

Shards small enough to fit in memory balance size of shards

Shard in-edges for interval of vertices sorted by source-id

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Shard 2 Shard 3 Shard 4Shard 1

1 Load

2 Compute

3 Write

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 25: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

25

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Load all in-edgesin memory

Load subgraph for vertices 1100

What about out-edges Arranged in sequence in other shards

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Shard 1

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 26: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

26

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW Loading Sub-graph

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

1 Load

2 Compute

3 Write

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 27: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

27

Parallel Sliding Windows

Only P large reads and writes for each interval

= P2 random accesses on one full pass

Works well on both SSD and magnetic hard disks

>

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 28: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

28

ldquoGAUSS-SEIDELrdquo ASYNCHRONOUS

How PSW computes

Chapter 6 + Appendix

Joint work Julian Shun

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 29: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

29

Synchronous vs Gauss-Seidel

bull Bulk-Synchronous Parallel (Jacobi iterations)ndash Updates see neighborsrsquo values from

previous iteration [Most systems are synchronous]

bull Asynchronous (Gauss-Seidel iterations)ndash Updates see most recent valuesndash GraphLab is asynchronous

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 30: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

30

Shard 1

Load all in-edgesin memory

Load subgraph for vertices 101700

Shard 2 Shard 3 Shard 4

PSW runs Gauss-Seidel

Vertices1100

Vertices101700

Vertices7011000

Vertices100110000

Out-edge blocksin memory

in-e

dges

for v

ertic

es 1

100

sort

ed b

y so

urce

_id

Fresh values in this ldquowindowrdquo from previous phase

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 31: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

31

Synchronous (Jacobi)

1 2 5 3 4

1 1 2 3 3

1 1 1 2 3

1 1 1 1 2

1 1 1 1 1

Bulk-Synchronous requires graph diameter ndashmany iterations to propagate the minimum label

Each vertex chooses minimum label of neighbor

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 32: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

32

PSW is Asynchronous (Gauss-Seidel)

1 2 4 3 5

1 1 1 3 3

1 1 1 1 1

Gauss-Seidel expected iterations on random schedule on a chain graph

= (N - 1) (e ndash 1) asymp 60 of synchronous

Each vertex chooses minimum label of neighbor

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 33: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

33

Open theoretical question

Label PropagationSide length = 100

Synchronous PSW Gauss-Seidel (average random schedule)

199 ~ 57

100 ~29

298

iterations

~52

Natural graphs (web social)

graph diameter - 1

~ 06 diameter

Joint work Julian Shun

Chapter 6

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 34: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

34

PSW amp External Memory Algorithms Research

bull PSW is a new technique for implementing many fundamental graph algorithmsndash Especially simple (compared to previous work)

for directed graph problems PSW handles both in- and out-edges

bull We propose new graph contraction algorithm based on PSWndash Minimum-Spanning Forest amp Connected

Components

bull hellip utilizing the Gauss-Seidel ldquoaccelerationrdquo

Joint work Julian Shun

SEA 2014 Chapter 6 + Appendix

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 35: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

35

GRAPHCHI SYSTEM EVALUATION

Sneak peek

Consult the paper for a comprehensive evaluationbull HD vs SSDbull Striping data across multiple hard

drivesbull Comparison to an in-memory

versionbull Bottlenecks analysisbull Effect of the number of shardsbull Block size and performance

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 36: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

37

GraphChi

bull C++ implementation 8000 lines of codendash Java-implementation also available

bull Several optimizations to PSW (see paper)

Source code and exampleshttpgithubcomgraphchi

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 37: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

38

Experiment Setting

bull Mac Mini (Apple Inc)ndash 8 GB RAMndash 256 GB SSD 1TB hard drivendash Intel Core i5 25 GHz

bull Experiment graphs

Graph Vertices Edges P (shards) Preprocessing

live-journal 48M 69M 3 05 min

netflix 05M 99M 20 1 min

twitter-2010 42M 15B 20 2 min

uk-2007-05 106M 37B 40 31 min

uk-union 133M 54B 50 33 min

yahoo-web 14B 66B 50 37 min

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 38: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

39

Comparison to Existing Systems

Notes comparison results do not include time to transfer the data to cluster preprocessing or the time to load the graph from disk GraphChi computes asynchronously while all but GraphLab synchronously

PageRank

See the paper for more comparisons

WebGraph Belief Propagation (U Kang et al)

Matrix Factorization (Alt Least Sqr) Triangle Counting

0 2 4 6 8 10 12

GraphLab v1 (8

cores)

GraphChi (Mac Mini)

Netflix (99B edges)

Minutes

0 2 4 6 8 10 12 14

Spark (50 machines)

GraphChi (Mac Mini)

Twitter-2010 (15B edges)

Minutes0 5 10 15 20 25 30

Pegasus Hadoop (100 ma-chines)

GraphChi (Mac Mini)

Yahoo-web (67B edges)

Minutes

0 50 100 150 200 250 300 350 400 450

Hadoop (1636 ma-

chines)

GraphChi (Mac Mini)

twitter-2010 (15B edges)

Minutes

On a Mac MiniGraphChi can solve as big problems as

existing large-scale systemsComparable performance

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 39: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

40

PowerGraph Comparisonbull PowerGraph GraphLab

2 outperforms previous systems by a wide margin on natural graphs

bull With 64 more machines 512 more CPUsndash Pagerank 40x faster than

GraphChindash Triangle counting 30x

faster than GraphChi

OSDIrsquo12

GraphChi has good performance CPU

2

vs

GraphChi

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 40: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

41

In-memory Comparison

bull Total runtime comparison to 1-shard GraphChi with initial load + output write taken into account

GraphChi Mac Mini ndash SSD 790 secs

Ligra (J Shun Blelloch) 40-core Intel E7-8870 15 secs

bull Comparison to state-of-the-artbull - 5 iterations of Pageranks Twitter (15B edges)

However sometimes better algorithm available for in-memory than external memory distributed

Ligra (J Shun Blelloch) 8-core Xeon 5550 230 s + preproc 144 s

PSW ndash inmem version 700 shards (see Appendix)

8-core Xeon 5550 100 s + preproc 210 s

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 41: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

42

Scalability Input Size [SSD]

bull Throughput number of edges processed second

Conclusion the throughput remains roughly constant when graph size is increased

GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low)

Graph size

000E+00 100E+09 200E+09 300E+09 400E+09 500E+09 600E+09 700E+09000E+00

500E+06

100E+07

150E+07

200E+07

250E+07

domain

twitter-2010

uk-2007-05 uk-union

yahoo-web

PageRank -- throughput (Mac Mini SSD)

Perf

orm

an

ce

Paper scalability of other applications

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 42: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

43

GRAPHCHI-DBNew work

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 43: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

44

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 44: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

45

Research Questions

bull What if there is lot of metadata associated with edges and vertices

bull How to do graph queries efficiently while retaining computational capabilities

bull How to add edges efficiently to the graph

Can we design a graph database based on GraphChi

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 45: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

46

Existing Graph Database Solutions

1) Specialized single-machine graph databases

Problemsbull Poor performance with data gtgt memorybull Noweak support for analytical computation

2) Relational key-value databases as graph storage

Problemsbull Large indicesbull In-edge out-edge dilemmabull Noweak support for analytical computation

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 46: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

47

PARTITIONED ADJACENCY LISTS (PAL) DATA STRUCTURE

Our solution

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 47: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

48

Review Edges in Shards

Shard 1

sort

ed b

y so

urce

_id

Shard 2 Shard 3 Shard PShard 1

0 100 101 700 N

Edge = (src dst)Partition by dst

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 48: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

49

Shard Structure (Basic)

Source Destination

1 8

1 193

1 76420

3 12

3 872

7 193

7 212

7 89139

hellip hellip src dst

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 49: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

50

Shard Structure (Basic)

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Compressed Sparse Row (CSR)

Problem 1

How to find in-edges of a vertex quickly

Note We know the shard but edges in random orderPointer-array

Edge-array

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 50: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

51

PAL In-edge Linkage

Destination

8

193

76420

12

872

193

212

89139

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

src dst

Pointer-array

Edge-array

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 51: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

52

PAL In-edge Linkage

Destination Link

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

Source File offset

1 0

3 3

7 5

hellip hellip

+ Index to the first in-edge for each vertex in interval

Pointer-array

Edge-array

Problem 2

How to find out-edges quickly

Note Sorted inside a shard but partitioned across all shards

Augmented linked list for in-edges

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 52: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

53

Source File offset

1 0

3 3

7 5

hellip hellip

PAL Out-edge Queries

Pointer-array

Edge-array

Problem Can be big -- O(V) Binary search on disk slow

Destination

Next-in-offset

8 3339

193 3

76420 1092

12 289

872 40

193 2002

212 12

89139 22

hellip

256

512

1024

hellip

Option 1Sparse index (inmem)

Option 2Delta-coding with unary code (Elias-Gamma) Completely in-memory

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 53: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

54

Experiment Indices

Median latency Twitter-graph 15B edges

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 54: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

55

Queries IO costsIn-edge query only one shard

Out-edge query each shard that has edges

Trade-offMore shards Better locality for in-edge queries worse for out-edge queries

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 55: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

Edge Data amp Searches

Shard X- adjacency

lsquoweightlsquo [float]

lsquotimestamprsquo [long] lsquobeliefrsquo (factor)

Edge (i)

No foreign key required to find edge data(fixed size)

Note vertex values stored similarly

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 56: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

57

Efficient Ingest

Shards on Disk

Buffers in RAM

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 57: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

58

Merging Buffers to Disk

Shards on Disk

Buffers in RAM

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 58: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

59

Merging Buffers to Disk (2)

Shards on Disk

Buffers in RAM

Merge requires loading existing shard from disk Each edge will be rewritten always on a merge

Does not scale number of rewrites data size buffer size = O(E)

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 59: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

60

Log-Structured Merge-tree

(LSM)Ref OrsquoNeil Cheng et al (1996)

Top shards with in-memory buffers

On-disk shards

LEVEL 1 (youngest)

Downstream merge

interval 1 interval 2 interval 3 interval 4 interval PInterval P-1

Intervals 1--4 Intervals (P-4) -- P

Intervals 1--16

New edges

LEVEL 2

LEVEL 3(oldest)

In-edge query One shard on each level

Out-edge queryAll shards

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 60: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

61

Experiment Ingest

0 2000 4000 6000 8000 10000 12000 14000 1600000E+00

50E+08

10E+09

15E+09

GraphChi-DB with LSM GraphChi-No-LSM

Time (seconds)

Edge

s

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 61: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

62

Advantages of PAL

bull Only sparse and implicit indicesndash Pointer-array usually fits in RAM with Elias-

Gamma Small database size

bull Columnar data modelndash Load only data you needndash Graph structure is separate from datandash Property graph model

bull Great insertion throughput with LSMTree can be adjusted to match workload

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 62: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

63

EXPERIMENTAL COMPARISONS

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 63: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

64

GraphChi-DB Implementation

bull Written in Scalabull Queries amp Computationbull Online database

All experiments shown in this talk done on Mac Mini (8 GB SSD)

Source code and exampleshttpgithubcomgraphchi

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 64: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

65

Comparison Database Size

Baseline 4 + 4 bytes edge

MySQL (data + indices)

Neo4j

GraphChi-DB

BASELINE

0 10 20 30 40 50 60 70

Database file size (twitter-2010 graph 15B edges)

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 65: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

66

Comparison Ingest

System Time to ingest 15B edges

GraphChi-DB (ONLINE) 1 hour 45 mins

Neo4j (batch) 45 hours

MySQL (batch) 3 hour 30 minutes (including index creation)

If running Pagerank simultaneously GraphChi-DB takes 3 hour 45 minutes

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 66: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

Comparison Friends-of-Friends Query

See thesis for shortest-path comparison

GraphChi-DB Neo4j0

00501

01502

02503

03504 0379

0127

Small graph - 50-percentile

mill

iseco

nds

GraphChi-DB Neo4j0123456789

6653

8078

Small graph - 99-percentile

mill

iseco

nds

GraphChi-DB

Neo4j MySQL0

40

80

120

160

200

224

7598

59

Big graph - 50-percentile

mill

iseco

nds 0

2000

4000

6000

1264 1631

4776

Big graph - 99-percentile

mill

isec

onds

Latency percentiles over 100K random queries

68M edges

15B edges

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 67: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

68

LinkBench Online Graph DB Benchmark by Facebook

bull Concurrent readwrite workloadndash But only single-hop queries (ldquofriendsrdquo)ndash 8 different operations mixed workloadndash Best performance with 64 parallel threads

bull Each edge and vertex hasndash Version timestamp type random string payload

GraphChi-DB (Mac Mini) MySQL+FB patch server SSD-array 144 GB RAM

Edge-update (95p) 22 ms 25 ms

Edge-get-neighbors (95p)

18 ms 9 ms

Avg throughput 2487 reqs 11029 reqs

Database size 350 GB 14 TB

See full results in the thesis

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 68: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

69

Summary of Experiments

bull Efficient for mixed readwrite workloadndash See Facebook LinkBench experiments in

thesisndash LSM-tree trade-off read performance (but

can adjust)

bull State-of-the-art performance for graphs that are much larger than RAMndash Neo4Jrsquos linked-list data structure good for

RAMndash DEX performs poorly in practice

More experiments in the thesis

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 69: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

70

Discussion

Greater ImpactHindsight

Future Research Questions

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 70: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

71

GREATER IMPACT

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 71: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

72

Impact ldquoBig Datardquo Research

bull GraphChirsquos OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar)

ndash Two major direct descendant papers in top conferencesbull X-Stream SOSP 2013bull TurboGraph KDD 2013

bull Challenging the mainstreamndash You can do a lot on just a PC focus on right

data structures computational models

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 72: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

73

Impact Users

bull GraphChirsquos (C++ Java -DB) have gained a lot of usersndash Currently ~50 unique visitors day

bull Enables lsquoeveryonersquo to tackle big graph problemsndash Especially the recommender toolkit (by Danny

Bickson) has been very popularndash Typical users students non-systems researchers

small companieshellip

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 73: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

74

Impact Users (cont)

I work in a [EU country] public university I cant use a a distributed computing cluster for my research it is too expensive

Using GraphChi I was able to perform my experiments on my laptop I thus have to admit that GraphChi saved my research ()

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 74: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

75

EVALUATION HINDSIGHT

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 75: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

76

What is GraphChi Optimized for

bull Original target algorithm Belief Propagation on Probabilistic Graphical Models

1 Changing value of an edge (both in- and out)

2 Computation process whole or most of the graph on each iteration

3 Random access to all vertexrsquos edgesndash Vertex-centric vs edge centric

4 AsyncGauss-Seidel execution

u v

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 76: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

77

GraphChi Not Good For

bull Very large vertex statebull Traversals and two-hop dependencies

ndash Or dynamic scheduling (such as Splash BP)

bull High diameter graphs such as planar graphsndash Unless the computation itself has short-range

interactions

bull Very large number of iterationsndash Neural networksndash LDA with Collapsed Gibbs sampling

bull No support for implicit graph structure

+ Single PC performance is limited

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 77: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

78

GraphChi (Parallel Sliding Windows)

GraphChi-DB(Partitioned Adjacency Lists)

Online graph updates

Batchcomp

Evolving graph

Incremental comp

DrunkardMob Parallel Random walk simulation

GraphChi^2

Triangle CountingItem-Item Similarity

PageRankSALSAHITS

Graph ContractionMinimum Spanning Forest

k-Core

Graph Computation

Shortest Path

Friends-of-Friends

Induced Subgraphs

Neighborhood query

Edge and vertex properties

Link prediction

Weakly Connected ComponentsStrongly Connected Components

Community DetectionLabel Propagation

Multi-BFS

Graph sampling

Graph Queries

Matrix Factorization

Loopy Belief PropagationCo-EM

Graph traversal

Versatility of PSW and PAL

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 78: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

79

Future Research Directions

bull Distributed Setting1 Distributed PSW (one shard node)

1 PSW is inherently sequential2 Low bandwidth in the Cloud

2 Co-operating GraphChi(-DB)rsquos connected with a Parameter Server

bull New graph programming models and Toolsndash Vertex-centric programming sometimes too local for example

two-hop interactions and many traversals cumbersomendash Abstractions for learning graph structure Implicit graphsndash Hard to debug especially async Better tools needed

bull Graph-aware optimizations to GraphChi-DBndash Buffer managementndash Smart cachingndash Learning configuration

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 79: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

80

WHAT IF WE HAVE PLENTY OF MEMORY

Semi-External Memory Setting

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 80: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

81

Observations

bull The IO performance of PSW is only weakly affected by the amount of RAMndash Good works with very little memoryndash Bad Does not benefit from more memory

bull Simple trick cache some data

bull Many graph algorithms have O(V) statendash Update function accesses neighbor vertex

statebull Standard PSW lsquobroadcastrsquo vertex value via edgesbull Semi-external Store vertex values in memory

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 81: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

82

Using RAM efficiently

bull Assume that enough RAM to store many O(V) algorithm states in memoryndash But not enough to store the whole

graph

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 82: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

83

Parallel Computation Examples

bull DrunkardMob algorithm (Chapter 5)ndash Store billions of random walk states in RAM

bull Multiple Breadth-First-Searchesndash Analyze neighborhood sizes by starting

hundreds of random BFSes

bull Compute in parallel many different recommender algorithms (or with different parameterizations)ndash See Mayank Mohta Shu-Hao Yursquos Masterrsquos

project

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 83: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

84

CONCLUSION

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 84: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

85

Summary of Published WorkGraphLab Parallel Framework for Machine Learning (with J Gonzaled YLow D Bickson CGuestrin)

UAI 2010

Distributed GraphLab Framework for Machine Learning and Data Mining in the Cloud (-- same --)

VLDB 2012

GraphChi Large-scale Graph Computation on Just a PC (with CGuestrin G Blelloch)

OSDI 2012

DrunkardMob Billions of Random Walks on Just a PC ACM RecSys 2013

Beyond Synchronous New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun G Blelloch)

SEA 2014

GraphChi-DB Simple Design for a Scalable Graph Database ndash on Just a PC (with C Guestrin)

(submitted arxiv)

Parallel Coordinate Descent for L1-regularized Loss Minimization (Shotgun) (with J Bradley D Bickson CGuestrin)

ICML 2011

First author papers in bold

THESIS

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 85: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

86

Summary of Main Contributions

bull Proposed Parallel Sliding Windows a new algorithm for external memory graph computation GraphChi

bull Extended PSW to design Partitioned Adjacency Lists to build a scalable graph database GraphChi-DB

bull Proposed DrunkardMob for simulating billions of random walks in parallel

bull Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms New approach for EM graph algorithms research

Thank You

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 86: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

87

ADDITIONAL SLIDES

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 87: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

88

Economics

GraphChi (40 Mac Minis) PowerGraph (64 EC2 cc14xlarge)

Investments 67320 $ -

Operating costs

Per node hour 003 $ 130 $

Cluster hour 119 $ 5200 $

Daily 2856 $ 124800 $

It takes about 56 days to recoup Mac Mini investments

Assumptions- Mac Mini 85W (typical servers 500-1000W)- Most expensive US energy 35c KwH

Equal throughput configurations (based on OSDIrsquo12)

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 88: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

89

PSW for In-memory Computation

bull External memory settingndash Slow memory =

hard disk SSDndash Fast memory =

RAM

bull In-memoryndash Slow = RAMndash Fast = CPU

caches

Small (fast) Memorysize M

Large (slow) Memorysize lsquounlimitedrsquo

B

CPU

block transfers

Does PSW help in the in-memory setting

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 89: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

90

PSW for in-memory

Appendix

PSW outperforms (slightly) Ligra the fastest in-memory graph computation system on Pagerank with the Twitter graph including preprocessing steps of both systems

Julian Shun Guy Blelloch PPoPP 2013

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 90: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

91

Remarks (sync vs async)

bull Bulk-Synchronous is embarrassingly parallelndash But needs twice the amount of space

bull AsyncG-S helps with high diameter graphsbull Some algorithms converge much better

asynchronouslyndash Loopy BP see Gonzalez et al (2009)ndash Also Bertsekas amp Tsitsiklis Parallel and Distributed

Optimization (1989)

bull Asynchronous sometimes difficult to reason about and debug

bull Asynchronous can be used to implement BSP

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 91: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

IO Complexity

bull See the paper for theoretical analysis in the Aggarwal-Vitterrsquos IO modelndashWorst-case only 2x best-case

bull Intuition

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 |V|v1 v2

Inter-interval edge is loaded from disk only once iteration

Edge spanning intervals is loaded twice iteration

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 92: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

93

Impact of Graph Structure

bull Algos with long range information propagation need relatively small diameter would require too many iterations

bull Per iteration cost not much affected

bull Can we optimize partitioningndash Could help thanks to Gauss-Seidel (faster

convergence inside ldquogroupsrdquo) topological sortndash Likely too expensive to do on single PC

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 93: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

94

Graph Compression Would it help

bull Graph Compression methods (eg Blelloch et al WebGraph Framework) can be used to compress edges to 3-4 bits edge (web) ~ 10 bits edge (social)ndash But require graph partitioning requires a lot of

memoryndash Compression of large graphs can take days (personal

communication)

bull Compression problematic for evolving graphs and associated data

bull GraphChi can be used to compress graphsndash Layered label propagation (Boldi et al 2011)

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 94: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

95

Previous research on (single computer) Graph Databases

bull 1990s 2000s saw interest in object-oriented and graph databasesndash GOOD GraphDB HyperGraphDBhellipndash Focus was on modeling graph storage on top of

relational DB or key-value store

bull RDF databasesndash Most do not use graph storage but store triples as

relations + use indexing

bull Modern solutions have proposed graph-specific storagendash Neo4j doubly linked listndash TurboGraph adjacency list chopped into pagesndash DEX compressed bitmaps (details not clear)

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 95: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

96

LinkBench

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 96: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

Comparison to FB (cont)

bull GraphChi load time 9 hours FBrsquos 12 hoursbull GraphChi database about 250 GB FB gt 14

terabytesndash However about 100 GB explained by different

variable data (payload) size

bull FacebookMySQL via JDBC GraphChi embeddedndash But MySQL native code GraphChi-DB Scala (JVM)

bull Important CPU bound bottleneck in sorting the results for high-degree vertices

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 97: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

LinkBench GraphChi-DB performance Size of DB

100E+07 100E+08 200E+08 100E+090

100020003000400050006000700080009000

10000

Requests sec

Number of nodes

Why not even faster when everything in RAM computationally bound requests

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 98: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

Possible Solutions

1 Use SSD as a memory-extension

[SSDAlloc NSDIrsquo11]

2 Compress the graph structure to fit into RAM[ WebGraph framework]

3 Cluster the graph and handle each cluster separately in RAM

4 Caching of hot nodes

Too many small objects need millions sec

Associated values do not compress well and are mutated

Expensive The number of inter-cluster edges is big

Unpredictable performance

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 99: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

Number of Shards

bull If P is in the ldquodozensrdquo there is not much effect on performance

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 100: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

Multiple hard-drives (RAIDish)

bull GraphChi supports striping shards to multiple disks Parallel IO

Pagerank Conn components0

500

1000

1500

2000

2500

3000

1 hard drive 2 hard drives 3 hard drives

Seco

nds

iter

ation

Experiment on a 16-core AMD server (from year 2007)

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 101: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

Bottlenecks

1 thread 2 threads 4 threads0

500

1000

1500

2000

2500

Disk IO Graph constructionExec updates

Connected Components on Mac Mini SSD

bull Cost of constructing the sub-graph in memory is almost as large as the IO cost on an SSDndash Graph construction requires a lot of random access in RAM memory

bandwidth becomes a bottleneck

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 102: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

Bottlenecks Multicore

Experiment on MacBook Pro with 4 cores SSD

bull Computationally intensive applications benefit substantially from parallel execution

bull GraphChi saturates SSD IO with 2 threads

1 2 40

20406080

100120140160

Matrix Factorization (ALS)

Loading Computation

Number of threads

Runti

me

(sec

onds

)1 2 4

0100200300400500600700800900

1000

Connected Components

Loading Computation

Number of threads

Runti

me

(sec

onds

)

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 103: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

104

In-memory vs Disk

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 104: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

105

Experiment Query latency

See thesis for IO cost analysis of inout queries

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 105: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

106

Example Induced Subgraph Queries

bull Induced subgraph for vertex set S contains all edges in the graph that have both endpoints in S

bull Very fast in GraphChi-DBndash Sufficient to query for out-edgesndash Parallelizes well multi-out-edge-query

bull Can be used for statistical graph analysisndash Sample induced neighborhoods induced FoF

neighborhoods from graph

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 106: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

Vertices Nodesbull Vertices are partitioned similarly as

edgesndash Similar ldquodata shardsrdquo for columns

bull Lookupupdate of vertex data is O(1)bull No merge tree here Vertex files are

ldquodenserdquondash Sparse structure could be supported1

a

a+1

2a Max-id

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 107: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

ID-mapping

bull Vertex IDs mapped to internal IDs to balance shardsndash Interval length constant a

012

aa+1a+2

2a2a+1

hellip

0 1 2 255

Original ID 0 1 2 255

256 257 258 511

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 108: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

109

What if we have a Cluster

Graph working memory (PSW) disk

computational state

Computation 1 stateComputation 2 state

RAM

Trade latency for throughput

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 109: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

110

Graph Computation Research Challenges

1 Lack of truly challenging (benchmark) applications

2 hellip which is caused by lack of good data available for the academics big graphs with metadatandash Industry co-operation But problem with

reproducibilityndash Also it is hard to ask good questions about

graphs (especially with just structure)

3 Too much focus on performance More important to enable ldquoextracting valuerdquo

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 110: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

DrunkardMob - RecSys 13

Random walk in an in-memory graph

bull Compute one walk a time (multiple in parallel of course)parfor walk in walks

for i=1 to numsteps vertex = walkatVertex() walktakeStep(vertexrandomNeighbor())

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 111: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

DrunkardMob - RecSys 13

Problem What if Graph does not fit in memory

Twitter network visualization by Akshay Java 2009

Distributed graph systems- Each hop across partition boundary is costly

Disk-based ldquosingle-machinerdquo graph systems- ldquoPagingrdquo from disk

is costly

(This talk)

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi
Page 112: Thesis Defense Large-Scale Graph Computation on Just a PC Aapo Kyrölä akyrola@cs.cmu.edu Carlos Guestrin University of Washington & CMU Guy Blelloch CMU.

DrunkardMob - RecSys 13

Random walks in GraphChi

bull DrunkardMob ndashalgorithmndash Reverse thinking

ForEach interval p walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p) mywalks = walkSnapshotgetWalksAtVertex(vertexid) ForEach walk in mywalks

walkManageraddHop(walk vertexrandomNeighbor())

Note Need to store only current position of each walk

  • Thesis Defense Large-Scale Graph Computation on Just a PC
  • Research Fields
  • Why Graphs
  • BigData with Structure BigGraph
  • Why on a single machine
  • Why use a cluster
  • Why use a cluster (2)
  • Benefits of single machine systems
  • Computing on Big Graphs
  • Big Graphs = Big Data
  • Research Goal
  • Terminology
  • Slide 13
  • Disk-based graph computation
  • Slide 15
  • Computational Model
  • Vertex-centric Programming
  • Computational Setting
  • The Main Challenge of Disk-based Graph Computation
  • Random Access Problem
  • Our Solution
  • Parallel Sliding Windows Phases
  • PSW Shards and Intervals
  • Example Layout
  • PSW Loading Sub-graph
  • PSW Loading Sub-graph (2)
  • Parallel Sliding Windows
  • ldquogauss-Seidelrdquo Asynchronous
  • Synchronous vs Gauss-Seidel
  • PSW runs Gauss-Seidel
  • Synchronous (Jacobi)
  • PSW is Asynchronous (Gauss-Seidel)
  • Label Propagation
  • PSW amp External Memory Algorithms Research
  • GraphChi system evaluation
  • GraphChi
  • Experiment Setting
  • Comparison to Existing Systems
  • PowerGraph Comparison
  • In-memory Comparison
  • Scalability Input Size [SSD]
  • Graphchi-DB
  • Slide 44
  • Research Questions
  • Existing Graph Database Solutions
  • Partitioned adjacency lists (PAL) data structure
  • Review Edges in Shards
  • Shard Structure (Basic)
  • Shard Structure (Basic) (2)
  • PAL In-edge Linkage
  • PAL In-edge Linkage (2)
  • PAL Out-edge Queries
  • Experiment Indices
  • Queries IO costs
  • Edge Data amp Searches
  • Efficient Ingest
  • Merging Buffers to Disk
  • Merging Buffers to Disk (2)
  • Log-Structured Merge-tree (LSM) Ref OrsquoNeil Cheng et al (19
  • Experiment Ingest
  • Advantages of PAL
  • Experimental comparisons
  • GraphChi-DB Implementation
  • Comparison Database Size
  • Comparison Ingest
  • Comparison Friends-of-Friends Query
  • LinkBench Online Graph DB Benchmark by Facebook
  • Summary of Experiments
  • Discussion
  • Greater impact
  • Impact ldquoBig Datardquo Research
  • Impact Users
  • Impact Users (cont)
  • Evaluation hindsight
  • What is GraphChi Optimized for
  • GraphChi Not Good For
  • Versatility of PSW and PAL
  • Future Research Directions
  • What if we have plenty of memory
  • Observations
  • Using RAM efficiently
  • Parallel Computation Examples
  • conclusion
  • Summary of Published Work
  • Summary of Main Contributions
  • Additional slides
  • Economics
  • PSW for In-memory Computation
  • PSW for in-memory
  • Remarks (sync vs async)
  • IO Complexity
  • Impact of Graph Structure
  • Graph Compression Would it help
  • Previous research on (single computer) Graph Databases
  • LinkBench
  • Comparison to FB (cont)
  • LinkBench GraphChi-DB performance Size of DB
  • Possible Solutions
  • Number of Shards
  • Multiple hard-drives (RAIDish)
  • Bottlenecks
  • Bottlenecks Multicore
  • In-memory vs Disk
  • Experiment Query latency
  • Example Induced Subgraph Queries
  • Vertices Nodes
  • ID-mapping
  • What if we have a Cluster
  • Graph Computation Research Challenges
  • Random walk in an in-memory graph
  • Problem What if Graph does not fit in memory
  • Random walks in GraphChi