Efficient Graph Processing with Distributed Immutable View

52
Efficient Graph Processing with Distributed Immutable View Rong Chen + , Xin Ding + , Peng Wang + , Haibo Chen + , Binyu Zang + and Haibing Guan * Institute of Parallel and Distributed Systems + Department of Computer Science * Shanghai Jiao Tong University 2014 HPDC Communication Computation

description

HPDC. 2014. Communication. Computation. Efficient Graph Processing with Distributed Immutable View. Rong Chen + , Xin Ding + , Peng Wang + , Haibo Chen + , Binyu Zang + and Haibing Guan * Institute of Parallel and Distributed Systems + Department of Computer Science * - PowerPoint PPT Presentation

Transcript of Efficient Graph Processing with Distributed Immutable View

Page 1: Efficient   Graph  Processing  with Distributed Immutable View

Efficient Graph Processing with Distributed Immutable View

Rong Chen+, Xin Ding+, Peng Wang+, Haibo Chen+, Binyu Zang+ and Haibing Guan*

Institute of Parallel and Distributed Systems +

Department of Computer Science *

Shanghai Jiao Tong University

2014HPDC

CommunicationComputation

Page 2: Efficient   Graph  Processing  with Distributed Immutable View

100 Hrs of Video

every minute

1.11 Billion Users

6 Billion Photos400 Million

Tweets/day

How do we understand and use Big Data?

Big Data Everywhere

Page 3: Efficient   Graph  Processing  with Distributed Immutable View

100 Hrs of Video

every minute

1.11 Billion Users

6 Billion Photos400 Million

Tweets/day

NLP

Big Data Big Learning

Machine Learning and Data Mining

Page 4: Efficient   Graph  Processing  with Distributed Immutable View

It’s about the graphs ...

Page 5: Efficient   Graph  Processing  with Distributed Immutable View

4 5

3 1 4

Example: PageRankA centrality analysis algorithm

to measure the relative rank for each element of a linked set

Characteristics□ Linked set data dependence□ Rank of who links it local accesses□ Convergence iterative computation

∑( 𝑗 , 𝑖 )∈𝐸

❑𝜔 𝑖𝑗𝑅 𝑗𝛼+(1−𝛼)𝑅𝑖=¿

4 5

1 23

4 5

3 1 4

4 5

3 1 21

Page 6: Efficient   Graph  Processing  with Distributed Immutable View

Existing Graph-parallel Systems“Think as a vertex” philosophy

1. aggregate value of neighbors2. update itself value3. activate neighbors compute (v)

PageRankdouble sum = 0double value, last =

v.get ()foreach (n in v.in_nbrs) sum += n.value /

n.nedges;value = 0.15 + 0.85 *

sum;v.set (value);activate (v.out_nbrs);

1

2

3

4 5

1 23

Page 7: Efficient   Graph  Processing  with Distributed Immutable View

Existing Graph-parallel Systems“Think as a vertex” philosophy

1. aggregate value of neighbors2. update itself value3. activate neighbors

Execution Engine□ sync: BSP-like model□ async: dist. sched_queues

Communication□ message passing: push value□ dist. shared memory: sync & pull

value

4 5

1 23

1 23 4 1

423

comp.

comm.

1 2push

1 1pull

2sync

barrier

Page 8: Efficient   Graph  Processing  with Distributed Immutable View

Issues of Existing SystemsPregel[SIGMOD’09]→ Sync engine→ Edge-cut

+ Message Passingw/o dynamic

comp.high contention

3

keep alive

21

4x1

x1

2 1 master2 1 replicamsg

GraphLab[VLDB’12]

PowerGraph[OSDI’12]

Page 9: Efficient   Graph  Processing  with Distributed Immutable View

Issues of Existing SystemsPregel[SIGMOD’09]→ Sync engine→ Edge-cut

+ Message Passing

GraphLab[VLDB’12]→ Async engine→ Edge-cut

+ DSM (replicas)w/o dynamic

comp.high contention

high contention

hard to programduplicated

edgesheavy comm. cost

3

keep alive

2

233

1 1

2

replica

11

44x1

x1

x2 x

2

5

dup

2 1 master2 1 replicamsg

PowerGraph[OSDI’12]

Page 10: Efficient   Graph  Processing  with Distributed Immutable View

Issues of Existing SystemsPregel[SIGMOD’09]→ Sync engine→ Edge-cut

+ Message Passing

GraphLab[VLDB’12]→ Async engine→ Edge-cut

+ DSM (replicas)

PowerGraph[OSDI’12]→ (A)Sync engine → Vertex-cut

+ GAS (replicas)w/o dynamic comp.

high contention

high contention

hard to programduplicated

edges

heavy comm. cost

high contentionheavy comm.

cost

3

keep alive

2

3

1 1

2

1

x5

x5

1

44x1

x1

233

1 1

2

replica

1

4x2 x

2

5

2 1 master2 1 replicamsg

5

dup

Page 11: Efficient   Graph  Processing  with Distributed Immutable View

ContributionsDistributed Immutable View

□ Easy to program/debug□ Support dynamic computation□ Minimized communication cost (x1 /replica)□ Contention (comp. & comm.) immunity

Multicore-based Cluster Support□ Hierarchical sync. & deterministic execution□ Improve parallelism and locality

Page 12: Efficient   Graph  Processing  with Distributed Immutable View

OutlineDistributed Immutable View

→ Graph organization→ Vertex computation→ Message passing→ Change of execution flow

Multicore-based Cluster Support→ Hierarchical model→ Parallelism improvement

Evaluation

Page 13: Efficient   Graph  Processing  with Distributed Immutable View

General Idea : For most graph algorithms,

vertex only aggregates neighbors’ data in one direction and activates in another direction□ e.g. PageRank, SSSP, Community Detection, …

Observation

Local aggregation/update & distributed activation□ Partitioning: avoid duplicate edges□ Computation: one-way local semantics□ Communication: merge update & activate messages

Page 14: Efficient   Graph  Processing  with Distributed Immutable View

Graph OrganizationPartitioning graph and build local sub-graph

□ Normal edge-cut: randomized (e.g., hash-based) or heuristic (e.g., Metis)

□ Only create one direction edges (e.g., in-edges)→ Avoid duplicated edges

□ Create read-only replicas for edges spanning machines

4 5

23 1

4

3 1

4

23 1

5

21

master replic

a

M1 M2 M3

Page 15: Efficient   Graph  Processing  with Distributed Immutable View

Vertex ComputationLocal aggregation/update

□ Support dynamic computation→ one-way local semantic

□ Immutable view: read-only access neighbors→ Eliminate contention on vertex

4 5

23 1

4

3 1

4

23 1

5

21

M1 M2 M3

read-only

Page 16: Efficient   Graph  Processing  with Distributed Immutable View

CommunicationSync. & Distributed Activation

□ Merge update & activate messages1. Update value of replicas2. Invite replicas to activate neighbors

4 5

23 1

4

3 1

4

23 1

5

21

rlist:W1 l-act: 1value: 8 msg: 4

l-act:3value:6 msg:3

msg: v|m|se.g. 8 4 0

M1 M2 M3

84

active

s0

Page 17: Efficient   Graph  Processing  with Distributed Immutable View

CommunicationDistributed Activation

□ Unidirectional message passing→ Replica will never be activated→ Always master replicas → Contention immunity

4 5

23 1

4

3 1

4

23 1

5

21

M1 M2 M3

Page 18: Efficient   Graph  Processing  with Distributed Immutable View

in-q

ueue

s

M1M3

out-queues

Change of Execution FlowOriginal Execution Flow (e.g. Pregel)

5

parsing11

8

computation sending

14

7

10

receiving

high overhead

high contention

M2 M3

M1

threadvertex

message4

2

Page 19: Efficient   Graph  Processing  with Distributed Immutable View

Change of Execution Flow

M1M3

out-queuescomputation sending

14

7

10

receiving lock-free

23

8

9

5

2

11

8

4

3

1

6

17 4

47

4

71

36

Execution Flow on Distributed Immutable View

low overhead

no contention

threadmaster

4replica4

M2 M3

M1

Page 20: Efficient   Graph  Processing  with Distributed Immutable View

OutlineDistributed Immutable View

→ Graph organization→ Vertex computation→ Message passing→ Change of execution flow

Multicore-based Cluster Support→ Hierarchical model→ Parallelism improvement

Evaluation

Page 21: Efficient   Graph  Processing  with Distributed Immutable View

Multicore SupportTwo Challenges

1. Two-level hierarchical organization→ Preserve synchronous and deterministic

computation nature (easy to program/debug)

2. Original BSP-like model is hard to parallelize → High contention to buffer and parse

messages→ Poor locality in message parsing

Page 22: Efficient   Graph  Processing  with Distributed Immutable View

Hierarchical ModelDesign Principle

□ Three level: iteration worker thread□ Only the last-level participants perform actual

tasks□ Parents (i.e. higher level participants) just wait

until all children finish their tasks

loop

tasktasktask

Level-0Level-1Level-2

workerthread

iteration

global barrier

local barrier

Page 23: Efficient   Graph  Processing  with Distributed Immutable View

Parallelism ImprovementOriginal BSP-like model is hard to parallelize

M1M3

out-queues

in-q

ueue

s 5

parsing

2

11

8

computation sending

14

7

10

receiving

threadvertex

message4

M2 M3

M1

Page 24: Efficient   Graph  Processing  with Distributed Immutable View

Parallelism ImprovementOriginal BSP-like model is hard to parallelize

M1M3

priv. out-queues

in-q

ueue

s 5

parsing

2

11

8

computation sending

14

7

10

receiving

M1M3

high contention

poor locality

threadvertex

message4

M2 M3

M1

Page 25: Efficient   Graph  Processing  with Distributed Immutable View

Parallelism Improvement

M1M3

out-queues14

7

10

23

8

9

5

2

11

8

4

3

1

6

17 4

47

1

74

63

computation sending receiving

Distributed immutable view opens an opportunity

threadmaster

4replica4

M2 M3

M1

Page 26: Efficient   Graph  Processing  with Distributed Immutable View

M2 M3

M1

Parallelism Improvement

M1M3

priv. out-queues14

7

10 M1M3

23

8

9

5

2

11

8

1

7

4

4

7 1

74

63 4

3

1

6poor locality

lock-freecomputation sending receiving

Distributed immutable view opens an opportunity

threadmaster

4replica4

Page 27: Efficient   Graph  Processing  with Distributed Immutable View

Parallelism Improvement

M1M31

4

7

10 M1M3

23

8

9

5

2

11

8

1

7

4

4

7 1

74

36 6

3

1

4

lock-freecomputation sending receiving

Distributed immutable view opens an opportunity

no interference

threadmaster

4replica4

M2 M3

M1

priv. out-queues

Page 28: Efficient   Graph  Processing  with Distributed Immutable View

M2 M3

M1

Parallelism ImprovementDistributed immutable view opens an opportunity

M1M31

4

7

10 M1M3

23

8

9

5

2

11

8

1

7

4

4

7 1

74

36 6

3

4

1

lock-free

sorted

computation sending receiving

good locality

threadmaster

4replica4

priv. out-queues

Page 29: Efficient   Graph  Processing  with Distributed Immutable View

OutlineDistributed Immutable View

→ Graph organization→ Vertex computation→ Message passing→ Change of execution flow

Multicore-based Cluster Support→ Hierarchical model→ Parallelism improvement

Implementation & Experiment

Page 30: Efficient   Graph  Processing  with Distributed Immutable View

ImplementationCyclops(MT)

□ Based on (Java & Hadoop)

□ ~2,800 SLOC□ Provide mostly compatible user interface□ Graph ingress and partitioning

→ Compatible I/O-interface→ Add an additional phase to build replicas

□ Fault tolerance→ Incremental checkpoint→ Replication-based FT [DSN’14]

Page 31: Efficient   Graph  Processing  with Distributed Immutable View

Experiment SettingsPlatform

□ 6X12-core AMD Opteron (64G RAM, 1GigE NIC)Graph Algorithms

□ PageRank (PR), Community Detection (CD), Alternating Least Squares (ALS), Single Source Shortest Path (SSSP)

Workload□ 7 real-world dataset from SNAP1 □ 1 synthetic dataset from GraphLab2

1http://snap.stanford.edu/data/

Dataset

|V| |E|

Amazon 0.4M 3.4MGWeb 0.9M 5.1M

LJournal 4.8M 69MWiki 5.7M 130

MSYN-GL 0.1M 2.7M

DBLP 0.3M 1.0MRoadCA 1.9M 5.5M

2http://graphlab.org

Page 32: Efficient   Graph  Processing  with Distributed Immutable View

Overall Performance Improvement

Amazon Gweb LJournal Wiki SYN-GL DBLP RoadCA0123456789

10 HamaCyclopsCyclopsMT

Norm

alize

d Sp

eedu

p

PageRank ALS CD SSSPPush-mode

8.69X

2.06X

48 workers

6 workers(8)

Page 33: Efficient   Graph  Processing  with Distributed Immutable View

Performance Scalability

6 12 24 4805

101520253035 Hama

CyclopsCy-clopsMT

Norm

alize

d Sp

eedu

p

Amazon6 12 24 48

GWeb6 12 24 48

LJournal6 12 24 48

Wiki

50.2

6 12 24 4805

101520253035

Norm

alize

d Sp

eedu

p

SYN-GL6 12 24 48

DBLP6 12 24 48

RoadCA

threads

workers

Page 34: Efficient   Graph  Processing  with Distributed Immutable View

Performance Breakdown

Amazon GWeb Ljournal Wiki SYN-GL DBLP RoadCA0.0 0.2 0.4 0.6 0.8 1.0

PARSESENDCOMPSYNC

Ratio

of E

xec-

Tim

e

PageRank ALS CD SSSP

0 6 12 18 24 300

100020003000400050006000

Iteration

#M

essa

ges

(K)

0 6 12 18 24 300

200400600800

1000

Hama

Iteration#Ve

rtice

s (K

)

CyclopsMT

HamaCyclops

Page 35: Efficient   Graph  Processing  with Distributed Immutable View

Comparison with PowerGraph1

Amazon GWeb LJournal Wiki020406080

100120 CyclopsMT

Pow-er-Graph

Exec

-Tim

e (S

ec)

Amazon GWeb LJournal Wiki0500

100015002000

#Mes

sage

s (M

)

Dataset

COMP%

Amazon 11%GWeb 15%

LJournal 25%Wiki 39%

Cyclops-like engine on GraphLab1 Platform

Preliminary Results

Regular Natural048

12

Exec

-Tim

e (S

ec)

1http://graphlab.org 2synthetic 10-million vertex regular (even edge) and power-law (α=2.0) graphs

22

1C++ & Boost RPC lib.

Page 36: Efficient   Graph  Processing  with Distributed Immutable View

ConclusionCyclops: a new synchronous vertex-oriented

graph processing system□ Preserve synchronous and deterministic

computation nature (easy to program/debug)□ Provide efficient vertex computation with

significantly fewer messages and contention immunity by distributed immutable view

□ Further support multicore-based cluster with hierarchical processing model and high parallelism

Source Code: http://ipads.se.sjtu.edu.cn/projects/cyclops

Page 37: Efficient   Graph  Processing  with Distributed Immutable View

Questions

Thanks

Cyclopshttp://

ipads.se.sjtu.edu.cn/projects/cyclops.html

IPADS

Institute of Parallel and Distributed

Systems

Page 38: Efficient   Graph  Processing  with Distributed Immutable View

PowerLyra: differentiated graph computation and partitioning on skewed natural graphs□ Hybrid engine and partitioning algorithms□ Outperform PowerGraph by up to 3.26X

for natural graphs

What’s Next?

http://ipads.se.sjtu.edu.cn/projects/powerlyra.html

213Low

High

R N048

1216

Exec

-Tim

e (S

ec)

Preliminary Results

PLPGCyclops

Power-law: “most vertices have relatively few neighbors while a few have many neighbors”

Page 39: Efficient   Graph  Processing  with Distributed Immutable View

GeneralityAlgorithms: aggregate/activate all neighbors

□ e.g. Community Detection (CD)□ Transfer to undirected graph and duplicate edges

4

3 1

4

23 1

5

21

M1 M2 M354 5

23 1

4 5

23 1

4

3 1

4

23 1

5

21

M1 M2 M3

Page 40: Efficient   Graph  Processing  with Distributed Immutable View

GeneralityAlgorithms: aggregate/activate all neighbors

□ e.g. Community Detection (CD)□ Transfer to undirected graph and duplicate edges□ Still aggregate in one direction (e.g. in-edges) and

activate in another direction (e.g. out-edges)□ Preserve all benefits of Cyclops

→ x1 /replica & contention immunity & good locality

4

3 1

4

23 1

5

21

M1 M2 M354 5

23 1

Page 41: Efficient   Graph  Processing  with Distributed Immutable View

4

3 1

4

23 1

5

21

M1 M2 M35

GeneralityDifference between Cyclops and GraphLab

1. How to construct local sub-graph2. How to aggregate/activate neighbors

4

3 1

4

23 1

5

21

M1 M2 M354 5

23 1

4 5

23 1

Page 42: Efficient   Graph  Processing  with Distributed Immutable View

Improvement of CyclopsMT

6x1x1/16x2x1/16x4x1/16x8x1/1

6x1x1/16x1x2/26x1x4/46x1x8/8

6x1x8/16x1x8/26x1x8/46x1x8/8

0.0 5.0

10.0 15.0 20.0 25.0 30.0 SEND COMP SYNC

Exec

utio

n Ti

me

(Sec

)

#[M]achines MxWxT/R#[W]orkers

#[T]hreads#[R]eceivers

CyclopsCyclopsMT

Page 43: Efficient   Graph  Processing  with Distributed Immutable View

Communication Efficiency

HamaCyclops

HamaCyclops

HamaCyclops

0.1 1.0 10.0 100.0 1,000.0

SENDPARSE

Exec-Time (Sec)

50M

25M

5M

25.6X

16.2X55.6%

12.6X25.0%

W0

W1W2W3W4W5

message:(id,data)

Hadoop RPC lib (Java) Boost RPC lib (C++)Hadoop RPC lib (Java)

Hama:PowerGrap

h:Cyclops:

send + buffer + parse (contention)

send + update

(contention)

31.5%

Page 44: Efficient   Graph  Processing  with Distributed Immutable View

Using Heuristic Edge-cut (i.e. Metis)

Amazon Gweb LJournal Wiki SYN-GL DBLP RoadCA0

5

10

15

20

25 HamaCyclopsCyclopsMT

Norm

alize

d Sp

eedu

p

PageRank ALS CD SSSP

23.04X

5.95X

48 workers

6 workers(8)

Page 45: Efficient   Graph  Processing  with Distributed Immutable View

Memory Consumption

Configuration

Max Cap (GB)

Max Usage (GB)

Young GC2

(#)Full GC2

(#)Hama/48 1.7 1.5 132 69

Cyclops/48 4.0 3.0 45 15CyclopsMT/

6x812.6/8 11.0/8 268/8 32/8

Memory Behavior1 per Worker(PageRank with Wiki dataset)

2 GC: Concurrent Mark-Sweep1 jStat

Page 46: Efficient   Graph  Processing  with Distributed Immutable View

Ingress Time

Dataset

LD REP INIT TOT

H C H C H C H CAmazon 6.2 5.9 0.0 2.5 1.7 1.5 7.9 9.9

GWeb 7.1 6.8 0.0 2.8 2.6 1.9 9.7 11.4LJournal 27.1 31.0 0.0 44.7 17.9 9.2 45.0 84.9

Wiki 46.7 46.7 0.0 62.2 33.4 20.4 80.0 129.3

SYN-GL 4.2 4.0 0.0 2.6 2.4 1.8 6.6 8.4DBLP 4.1 4.1 0.0 1.5 1.3 0.9 5.4 6.5

RoadCA 6.4 6.2 0.0 3.9 0.9 0.6 7.3 10.7

CyclopsHama

Page 47: Efficient   Graph  Processing  with Distributed Immutable View

Selective ActivationSync. & Distributed Activation

□ Merge update & activate messages1. Update value of replicas2. Invite replicas to activate neighbors

4 5

23 1

4

3 1

4

23 1

5

21

rlist:W1 l-act: 1value: 8 msg: 4

l-act:3value:6 msg:3

msg: v|m|se.g. 8 4 0

M1 M2 M3

84

active

msg: v|m|s|l

*Selective Activation (e.g. ALS)

Option: Activation_List

s0

Page 48: Efficient   Graph  Processing  with Distributed Immutable View

M2 M3

M1

Parallelism ImprovementDistributed immutable view opens an opportunity

M1M3

out-queues14

7

10 M1M3

23

8

9

5

2

11

8

1

7

4

4

7 1

74

36 6

3

4

1

lock-free

sorted

computation sending receiving

good locality

comp.threads

comm.threadsvs.separate

configuration

threadmaster

4replica4

Page 49: Efficient   Graph  Processing  with Distributed Immutable View

w/ dynamic comp.

no contention

easy to program

duplicated edgeslow comm. cost

CyclopsExisting graph-parallel

systems (e.g., Pregel, GraphLab, PowerGraph)

Cyclops(MT)→ Distributed

Immutable View

w/o dynamic comp.

high contention

hard to program

duplicated edgesheavy comm. cost

233

1 1

5replica

1

4x1

x1

Page 50: Efficient   Graph  Processing  with Distributed Immutable View

BiGraph: bipartite-oriented distributed graph partitioning for big learning□ A set of online distributed graph partition

algorithms designed for bipartite graphs and applications

□ Partition graphs in a differentiated way and loading data according to the data affinity

□ Outperform PowerGraph with default partition by up to 17.75X, and save up to 96% network traffic

What’s Next?

http://ipads.se.sjtu.edu.cn/projects/powerlyra.html

Page 51: Efficient   Graph  Processing  with Distributed Immutable View
Page 52: Efficient   Graph  Processing  with Distributed Immutable View

Multicore SupportTwo Challenges

1. Two-level hierarchical organization→ Preserve synchronous and deterministic

computation nature (easy to program/debug)

2. Original BSP-like model is hard to parallelize → High contention to buffer and parse

messages→ Poor locality in message parsing→ Asymmetric degree of parallelism for CPU and

NIC