Would you rather be able Would you rather only eat to fly ...
If you were plowing a field, which would you rather use?
description
Transcript of If you were plowing a field, which would you rather use?
1
If you were plowing a field, which would you rather
use?Two oxen, or 1024 chickens?(Attributed to S. Cray)
A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph
Processing
Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto
and Matei Ripeanu
NetSysLabThe University of British Columbia
http://netsyslab.ece.ubc.ca
3
Graphs are Everywhere
4
Graphs Processing Challenges
Poor locality
Caches + summary data structures
Massive hardware multithreading
Low compute-to-memory access ratio
Large memory footprint >128GB 6GB
[Hong, S. 2011]
CPUs GPUs
5
YES WE CAN!2x speedup (4 billion edges)
Motivating Question
Can we efficiently use hybrid systems for
large-scale graph processing?
6
Performance Model Predicts speedup Intuitive
Totem A graph processing engine for hybrid
systems Applies algorithm-agnostic optimizations
Evaluation Predicated vs achieved Hybrid vs Symmetric
Methodology
7
Core0
Core4
Core1
Core0
.
.
Core1
Core2
Corem
Core3
System MemoryDevice Memory
Host GPUb
The Performance Model (I)
GgpuGcpu
||/|| EEcpuα = ||/|| EEboundaryβ =
)(/|| GTErcpu =
c
rSpeedup
cpu
1
Predicts the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host)
)(/ msizeofbc =
8
GgpuGcpu
The Performance Model (II)
β = 20% rcpu = 0.5 BEPS
It is beneficial to process the graph on a hybrid system if communication overhead is kept low
Core0
Core4
Core1
Core0
.
.
Core1
Core2
Corem
Core3
System MemoryDevice Memory
Host GPU
||/|| EEcpuα = ||/|| EEboundaryβ =
)(/|| GTErcpu =
)(/ msizeofbc =
c
rSpeedup
cpu
1
Assume PCI-E bus, b ≈ 4 GB/secand per edge state m = 4 bytes
=> c = 1 billion EPS
x
Best reported single-node BFS performance
[Agarwal, V. 2010]
|V| = 32M, |E| = 1B
Worst case (e.g., bipartite graph)
9
Totem: Programming ModelBulk Synchronous Parallel
Rounds of computation and communication phases
Updates to remote vertices are delivered in the next round
Partitions vote to terminate execution
.
.
.
10
Totem: A BSP-based Engine
1
0
3
2
4
5
CPU GPU0
3
2
1S S
outbox
0 20 1
outbox
40
0 1 2 3 4 5
0 1 3 54 6
10 20 30 40 40E
V
01 21 21
0 1 2 3
0 1 3 3
31 40 E
V
2121
Compressed sparse row representation
Computation: kernel manipulates local state
Updates to remote vertices aggregated locally
Comm2: merge with local state
Comm1: transfer outbox buffer to remote input buffer
inbox4
inbox
0 2
11
|E| = 512 Million
Random
The Aggregation Opportunity
real-world graphs are mostly scale-free: skewed
degree distribution
sparse graph: ~5x reduction
Denser graph has better opportunity for aggregation:
~50x reduction
12
Evaluation Setup
Workload R-MAT graphs |V|=32M, |E|=1B, unless otherwise noted
Algorithms Breadth-first Search PageRank
Metrics Speedup compared to processing on the host only
Testbed Host: dual-socket Intel Xeon with 16GB GPU: Nvidia Tesla C2050 with 3GB
13
Predicted vs Achieved Speedup
Linear speedup with respect to offloaded part
GPU partition fills GPU memory
After aggregation, β = 2%. A low value
is critical for BFS
14
Breakdown of Execution Time
PageRank is dominated by the compute phase
Aggregation significantly reduced communication
overhead
GPU is > 5x faster than the host
15
Effect of Graph Density
Sparser graphs have higher β
Deviation due to not incorporating pre- and
post-transfer overheads in the model
16
Contributions
Performance modeling Simple Useful for initial system provisioning
Totem Generic graph processing framework Algorithm-agnostic optimizations
Evaluation (Graph500 scale-28) 2x speedup over a symmetric system 1.13 Billion TEPS edges on a dual-socket, dual-GPU
system
17
Questions?
code available at: netsyslab.ece.ubc.ca
17