Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 ·...
Transcript of Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 ·...
Design Considerations for a GraphBLASCompliant Graph Library on Clusters of GPUs
July 11, 2016
Carl Yang, University of California, DavisAydın Buluç, Lawrence Berkeley National LaboratoryJohn D. Owens, University of California, Davis
GraphBLASInstead of writing a graph algorithm from scratch, users can express algorithm using a small number of GraphBLAS kernels
Hardware-agnostic:Users do not have to relearn many different APIs in order to run graph algorithms on different hardware
Efficiency:Kernels are implemented with performance guarantees
Why use GPUs for graphs?Cost-efficientK20X costs ~$2000
Improving faster that traditional CPUsP100 (2016): ~8 TFLOPS, ~1TB/s, ~3800 ALUs, 32GB mem.CPU-to-GPU bandwidth: ~80GB/s-200GB/s (NVLink)Xeon Phi 7250 (2016): ~6 TFLOPS, ~400GB/s, ~288 threads, 16GB mem.CPU-to-coprocessor bandwidth: ~100GB/s (Omni-Path)
Challenges with using GPUs for graph analysis
Hard to programSmall memoryCPU RAM available ~100x larger
Solved by GraphBLASAPISolved by using multi-GPU
GraphBLAS kernel (mXv)Goal: Implement mXv on multiple GPUs following GraphBLASstandardSame as SpMVSame as vXm, which computes yT = xTA
Input: Given adjacency matrix A, vector xWant: Compute y = ATxCurrent GraphBLAS API:
GrB_info GrB_mXv( GrB_Vector *y, const GrB_Semiring s, const GrB_Matrix A, const GrB_vector x )
Def: Think of a semiring as an object with a “x” operator, “+” operator, and identity for each operator.
Applications of SpMVSSSPPageRankMISAnd many more!
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x = ?
Each iteration, 1) compute y = ATx2) update BFS result3) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 1)
10000
0-1-1-1-1
0
1
2
3
4
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x =
Each iteration, 1) compute y = ATx2) update BFS result3) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 1)
10000
01000
0-1-1-1-1
0
1
2
3
4
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x =
Each iteration, 1) compute y = ATx new: -22) update BFS result3) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 1)
10000
01000
01-1-1-1
0
1
2
3
4
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x =
Each iteration, 1) compute y = ATx2) update BFS result old: -23) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 2)
01000
00110
01-1-1-1
0
1
2
3
4
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x =
Each iteration, 1) compute y = ATx new: 42) update BFS result old: -23) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 2)
01000
00110
0122-1
0
1
2
3
4
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x =
Each iteration, 1) compute y = ATx2) update BFS result old: 43) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 3)
00110
00012
0122-1
0
1
2
3
4
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x =
Each iteration, 1) compute y = ATx new: 82) update BFS result old: 43) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 3)
00110
00012
01223
0
1
2
3
4
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x =
Each iteration, 1) compute y = ATx new: 82) update BFS result old: 83) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 4)
00001
00000
01223
0
1
2
3
4
Design Considerations2.How to solve multiway-merge3.Partitioning scheme
AT x y
x =
• 1) Elementwise multiply• 2) Reduce• Work: O(nnz)• Independent of vector sparsity
0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0
1. Row-based SpMV
00214
20864
Why not Row-based?cuSPARSE, Modern GPU, CUB all provide row-based SpMV GPU implementationsWhy implement our own?
x = A Motivating Example
0 0 1 0 0 1 0 0 2 24 0 0 0 0 1 3 0 0 20 0 1 2 1 0 0 0 7 91 4 3 0 0 2 4 3 3 90 5 2 0 0 2 0 0 0 01 2 0 0 0 0 2 2 0 03 3 2 0 0 2 0 0 0 00 0 2 0 0 0 0 2 3 00 0 0 0 0 0 3 0 0 03 0 2 2 9 0 0 0 0 0
0000002000
0608040060
• Take scalar from vector x, and multiply it by the corresponding column from A
AT x y
AT x = y3 + y4 + y5 = y
x = + + =
• 1) Gather and elementwise multiply• 2) Multiway-merge (many elementwise adds)• Work: 𝑂 ∑ 𝑑)*[)]-. + kn• Suitable for sparse vectors, due to dependence on vector
sparsity
0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0
Column-based SpMV
00214
20864
20264
00200
00400
For GraphBLAS folksboth a row-based SpMV as well as a column-based SpMV and automatically decide which one to use?This is the idea behind direction-optimizing BFS
Design Considerations1.Column-based SpMV2.How to solve multiway-merge3.Partitioning scheme
2. Intuition of multiway-mergeleading to duplicates
Input: sorted segments, with possible duplicatesE.g. [4 5 6 4 5 6 3 4 5]
Want: uniqueE.g. [4 5 6 3]
0 1 2
3 4 5 6
Block Diagram (Single GPU)
(+): Addition operator of semiring(x): Multiplication operator of semiring
Gather columns from AT
(x) Multiway-merge
Input: Sparse Vector xSparse Matrix AT
(+) ewiseMult
Output: Sparse Vector y
(+): Addition operator of semiring(x): Multiplication operator of semiring
(x) Multiway-merge
Multiway-merge
Gather columns from AT
(x) Multiway-merge
Input: Sparse Vector xSparse Matrix AT
(+) ewiseMult
Radix Sort
(x) Segmented ReduceOutput: Sparse Vector y
Radix sort works best
Importance of multiway-merge
Multiway-merge takes ~85.6% of application runtimeGather and ewiseMult take 11ms
For GraphBLAS folksautomatically decide which one to use?This is the idea behind direction-optimizing BFS
Should sparse vector be allowed to contain duplicate entries?This is the idea behind Gunrock’s idempotent discovery that lets them get away without solving multiway-merge completelySome duplicates are allowed
Design Considerations1.Column-based SpMV2.How to solve multiway-merge3.Partitioning scheme
A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1x2
x x3 = + + +...+ = ...xp
Each iteration:1) GPU i computes y(i) = Aixi2) Computes histogram of send counts3) Send to other GPUs using MPI_Alltoall with send counts4) Send to other GPUs using MPI_Alltoallv5) Each GPU multiway-merges piece they are responsible for
3. Partitioning scheme
3. Partitioning scheme
c
A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1 y1x2
x x3 = + + +...+ = ...xp
GPU1Each iteration:1) GPU i computes y(i) = Aixi2) Computes histogram of send counts3) Send to other GPUs using MPI_Alltoall with send counts4) Send to other GPUs using MPI_Alltoallv5) Each GPU multiway-merges piece they are responsible for
A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1 x2 y2
x x3 = + + +...+ = ...xp
GPU2Each iteration:1) GPU i computes y(i) = Aixi2) Computes histogram of send counts3) Send to other GPUs using MPI_Alltoall with send counts4) Send to other GPUs using MPI_Alltoallv5) Each GPU multiway-merges piece they are responsible for
3. Partitioning scheme
c
3. Partitioning scheme
A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1 y1x2
x x3 = + + +...+ = ...xp
Each iteration:1) GPU i computes y(i) = Aixi2) Computes histogram of send counts3) Send to other GPUs using MPI_Alltoall with send counts4) Send to other GPUs using MPI_Alltoallv5) Each GPU multiway-merges piece they are responsible for
3. Partitioning scheme
c
A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1 x2 y2
x x3 = + + +...+ = ...xp
GPU2Each iteration:1) GPU i computes y(i) = Aixi2) Computes histogram of send counts3) Send to other GPUs using MPI_Alltoall with send counts4) Send to other GPUs using MPI_Alltoallv5) Each GPU multiway-merges piece they are responsible for
Block Matrix (Multi-GPU)On GPU 1, 2, …, p:
mXv (Single GPU)
Gather columns from Ai
Multiway-merge
Input: Sparse Vector xiSparse Matrix Ai
ewiseMult
Output: Sparse Vector yi
mXv (Single GPU)
MPI_Alltoallv
MPI_Alltoall
Multiway-merge
Histogram
c
c
Why 2 multiway-merges?
Experimental SetuplCPUl64x 16-core AMD Opteron 6274 CPUl32 GB of main memory
lGPUl64x NVIDIA K20X GPUs l14 SMs, 192 ALUs per SM, 732 MHz, 6 GB on-board memory
lInterconnectlGemini interconnect between nodesl10.4 GB/s (HyperTransport 3.0)
Strong Scaling
0.5
1
2
4
8
16
1 2 4 8 16 32 64
Billions of Edges Traversed (GTEPS)
Number of Processors
Ideal ScalingActual Scaling
Weak Scaling (Vertex)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1 2 4 8 16 32
Billions of Edges Traversed
per Processor (GTEPS)
Number of Processors
Ideal ScalingActual Scaling
For 4 GPUs, everything looks fine
Majority of runtime is compute• Two biggest iterations brought down from 121ms to 40ms
Compute time > Communication
MPI calls comprise a sliver of runtime
At 16 GPUs, picture becomes bleak
MPI calls take up ~50% runtime now
Not overlapping compute with communication
GPUs sit idle to wait for other GPUs to finish compute
Communication vs. Compute
0
1
2
3
4
5
6
7
1 2 4 8 16
Relative Cost per Processor
Number of Processors
CommunicationCompute
How to bridge this gap?
Work-in-progress1.Overlap compute with communication
Use multiple cudaStreams to make communication asynchronous
2.Try other partitioning schemesVariants of 1D column-based partitioning2D partitioning
3.Partition graph betterReverse Cuthill–McKee
ConclusionRow-based vs. Column-based implementation for mXvBoth have their uses
Should sparse vector tolerate duplicate entries if user does not care about intermediate sparse vectors?
Acknowledgements
supportThis work is funded by:DARPA XDATA program US Army award W911QX-12-C-0059
DARPA STTR award D15PC00010Department of Energy, Office of Science, ASCR Contract No. DE-AC02-05CH11231
Questions?Aydın Buluç, Lawrence Berkeley National [email protected]
John D. Owens, University of Davis, [email protected]
Backup Slides
Applications of Interest
• Bioinformatics• Social network analysis• Computer vision• Fraud detection• Urban planning
Challenges with Graph AnalysisToo big to visualize
Software complexityStructural properties varySmall-world: Difficult to partitionImpacts scalabilityMesh:Limits scalability of synchronous graph algorithms
AT x y
x =
• GPU threads compute each row simultaneously• Work: O(nnz)• Suitable for multiplication by dense vector, because work is
independent of vector sparsity• Direction-optimizing or bottom-up BFS is a special case of this
0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0
1. Row-based mXv (SpMV)
00214
20864
Dominance of Big Iterations in BFS of Scale-free Graphs
Two biggest iterations take 91% of runtime• Other six iterations take 6ms combined