If you were plowing a field, which would you rather use?

1

If you were plowing a field, which would you rather

use?Two oxen, or 1024 chickens?(Attributed to S. Cray)

A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph

Processing

Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto

and Matei Ripeanu

NetSysLabThe University of British Columbia

http://netsyslab.ece.ubc.ca

3

Graphs are Everywhere

4

Graphs Processing Challenges

Poor locality

Caches + summary data structures

Massive hardware multithreading

Low compute-to-memory access ratio

Large memory footprint >128GB 6GB

[Hong, S. 2011]

CPUs GPUs

5

YES WE CAN!2x speedup (4 billion edges)

Motivating Question

Can we efficiently use hybrid systems for

large-scale graph processing?

6

Performance Model Predicts speedup Intuitive

Totem A graph processing engine for hybrid

systems Applies algorithm-agnostic optimizations

Evaluation Predicated vs achieved Hybrid vs Symmetric

Methodology

7

Core0

Core4

Core1

Core0

.

.

Core1

Core2

Corem

Core3

System MemoryDevice Memory

Host GPUb

The Performance Model (I)

GgpuGcpu

||/|| EEcpuα = ||/|| EEboundaryβ =

)(/|| GTErcpu =

c

rSpeedup

cpu

1

Predicts the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host)

)(/ msizeofbc =

8

GgpuGcpu

The Performance Model (II)

β = 20% rcpu = 0.5 BEPS

It is beneficial to process the graph on a hybrid system if communication overhead is kept low

Core0

Core4

Core1

Core0

.

.

Core1

Core2

Corem

Core3

System MemoryDevice Memory

Host GPU

||/|| EEcpuα = ||/|| EEboundaryβ =

)(/|| GTErcpu =

)(/ msizeofbc =

c

rSpeedup

cpu

1

Assume PCI-E bus, b ≈ 4 GB/secand per edge state m = 4 bytes

=> c = 1 billion EPS

x

Best reported single-node BFS performance

[Agarwal, V. 2010]

|V| = 32M, |E| = 1B

Worst case (e.g., bipartite graph)

9

Totem: Programming ModelBulk Synchronous Parallel

Rounds of computation and communication phases

Updates to remote vertices are delivered in the next round

Partitions vote to terminate execution

.

.

.

10

Totem: A BSP-based Engine

1

0

3

2

4

5

CPU GPU0

3

2

1S S

outbox

0 20 1

outbox

40

0 1 2 3 4 5

0 1 3 54 6

10 20 30 40 40E

V

01 21 21

0 1 2 3

0 1 3 3

31 40 E

V

2121

Compressed sparse row representation

Computation: kernel manipulates local state

Updates to remote vertices aggregated locally

Comm2: merge with local state

Comm1: transfer outbox buffer to remote input buffer

inbox4

inbox

0 2

11

|E| = 512 Million

Random

The Aggregation Opportunity

real-world graphs are mostly scale-free: skewed

degree distribution

sparse graph: ~5x reduction

Denser graph has better opportunity for aggregation:

~50x reduction

12

Evaluation Setup

Workload R-MAT graphs |V|=32M, |E|=1B, unless otherwise noted

Algorithms Breadth-first Search PageRank

Metrics Speedup compared to processing on the host only

Testbed Host: dual-socket Intel Xeon with 16GB GPU: Nvidia Tesla C2050 with 3GB

13

Predicted vs Achieved Speedup

Linear speedup with respect to offloaded part

GPU partition fills GPU memory

After aggregation, β = 2%. A low value

is critical for BFS

14

Breakdown of Execution Time

PageRank is dominated by the compute phase

Aggregation significantly reduced communication

overhead

GPU is > 5x faster than the host

15

Effect of Graph Density

Sparser graphs have higher β

Deviation due to not incorporating pre- and

post-transfer overheads in the model

16

Contributions

Performance modeling Simple Useful for initial system provisioning

Totem Generic graph processing framework Algorithm-agnostic optimizations

Evaluation (Graph500 scale-28) 2x speedup over a symmetric system 1.13 Billion TEPS edges on a dual-socket, dual-GPU

system

17

Questions?

code available at: netsyslab.ece.ubc.ca

17

If you were plowing a field, which would you rather use?

Documents

Transcript of If you were plowing a field, which would you rather use?