Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 ·...

Design Considerations for a GraphBLASCompliant Graph Library on Clusters of GPUs

July 11, 2016

Carl Yang, University of California, DavisAydın Buluç, Lawrence Berkeley National LaboratoryJohn D. Owens, University of California, Davis

GraphBLASInstead of writing a graph algorithm from scratch, users can express algorithm using a small number of GraphBLAS kernels

Hardware-agnostic:Users do not have to relearn many different APIs in order to run graph algorithms on different hardware

Efficiency:Kernels are implemented with performance guarantees

Why use GPUs for graphs?Cost-efficientK20X costs ~$2000

Improving faster that traditional CPUsP100 (2016): ~8 TFLOPS, ~1TB/s, ~3800 ALUs, 32GB mem.CPU-to-GPU bandwidth: ~80GB/s-200GB/s (NVLink)Xeon Phi 7250 (2016): ~6 TFLOPS, ~400GB/s, ~288 threads, 16GB mem.CPU-to-coprocessor bandwidth: ~100GB/s (Omni-Path)

Challenges with using GPUs for graph analysis

Hard to programSmall memoryCPU RAM available ~100x larger

Solved by GraphBLASAPISolved by using multi-GPU

GraphBLAS kernel (mXv)Goal: Implement mXv on multiple GPUs following GraphBLASstandardSame as SpMVSame as vXm, which computes yT = xTA

Input: Given adjacency matrix A, vector xWant: Compute y = ATxCurrent GraphBLAS API:

GrB_info GrB_mXv( GrB_Vector *y, const GrB_Semiring s, const GrB_Matrix A, const GrB_vector x )

Def: Think of a semiring as an object with a “x” operator, “+” operator, and identity for each operator.

Applications of SpMVSSSPPageRankMISAnd many more!

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x y BFS

x = ?

Each iteration, 1) compute y = ATx2) update BFS result3) use BFS result to filter y4) stop if BFS result is unchanged

BFS Using SpMV (Iteration 1)

10000

0-1-1-1-1

0

1

2

3

4

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x y BFS

x =

Each iteration, 1) compute y = ATx2) update BFS result3) use BFS result to filter y4) stop if BFS result is unchanged


10000

01000

0-1-1-1-1

0

1

2

3

4

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x y BFS

x =

Each iteration, 1) compute y = ATx new: -22) update BFS result3) use BFS result to filter y4) stop if BFS result is unchanged


10000

01000

01-1-1-1

0

1

2

3

4

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x y BFS

x =

Each iteration, 1) compute y = ATx2) update BFS result old: -23) use BFS result to filter y4) stop if BFS result is unchanged


01000

00110

01-1-1-1

0

1

2

3

4

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x y BFS

x =

Each iteration, 1) compute y = ATx new: 42) update BFS result old: -23) use BFS result to filter y4) stop if BFS result is unchanged


01000

00110

0122-1

0

1

2

3

4

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x y BFS

x =

Each iteration, 1) compute y = ATx2) update BFS result old: 43) use BFS result to filter y4) stop if BFS result is unchanged


00110

00012

0122-1

0

1

2

3

4

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x y BFS

x =

Each iteration, 1) compute y = ATx new: 82) update BFS result old: 43) use BFS result to filter y4) stop if BFS result is unchanged


00110

00012

01223

0

1

2

3

4

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x y BFS

x =

Each iteration, 1) compute y = ATx new: 82) update BFS result old: 83) use BFS result to filter y4) stop if BFS result is unchanged


00001

00000

01223

0

1

2

3

4

Design Considerations2.How to solve multiway-merge3.Partitioning scheme

AT x y

x =

• 1) Elementwise multiply• 2) Reduce• Work: O(nnz)• Independent of vector sparsity

0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0

1. Row-based SpMV

00214

20864

Why not Row-based?cuSPARSE, Modern GPU, CUB all provide row-based SpMV GPU implementationsWhy implement our own?

x = A Motivating Example

0 0 1 0 0 1 0 0 2 24 0 0 0 0 1 3 0 0 20 0 1 2 1 0 0 0 7 91 4 3 0 0 2 4 3 3 90 5 2 0 0 2 0 0 0 01 2 0 0 0 0 2 2 0 03 3 2 0 0 2 0 0 0 00 0 2 0 0 0 0 2 3 00 0 0 0 0 0 3 0 0 03 0 2 2 9 0 0 0 0 0

0000002000

0608040060

• Take scalar from vector x, and multiply it by the corresponding column from A

AT x y

AT x = y3 + y4 + y5 = y

x = + + =

• 1) Gather and elementwise multiply• 2) Multiway-merge (many elementwise adds)• Work: 𝑂 ∑ 𝑑)*[)]-. + kn• Suitable for sparse vectors, due to dependence on vector

sparsity

0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0

Column-based SpMV

00214

20864

20264

00200

00400

For GraphBLAS folksboth a row-based SpMV as well as a column-based SpMV and automatically decide which one to use?This is the idea behind direction-optimizing BFS

Design Considerations1.Column-based SpMV2.How to solve multiway-merge3.Partitioning scheme

2. Intuition of multiway-mergeleading to duplicates

Input: sorted segments, with possible duplicatesE.g. [4 5 6 4 5 6 3 4 5]

Want: uniqueE.g. [4 5 6 3]

0 1 2

3 4 5 6

Block Diagram (Single GPU)

(+): Addition operator of semiring(x): Multiplication operator of semiring

Gather columns from AT

(x) Multiway-merge

Input: Sparse Vector xSparse Matrix AT

(+) ewiseMult

Output: Sparse Vector y

(+): Addition operator of semiring(x): Multiplication operator of semiring

(x) Multiway-merge

Multiway-merge

Gather columns from AT

(x) Multiway-merge

Input: Sparse Vector xSparse Matrix AT

(+) ewiseMult

Radix Sort

(x) Segmented ReduceOutput: Sparse Vector y

Radix sort works best

Importance of multiway-merge

Multiway-merge takes ~85.6% of application runtimeGather and ewiseMult take 11ms

For GraphBLAS folksautomatically decide which one to use?This is the idea behind direction-optimizing BFS

Should sparse vector be allowed to contain duplicate entries?This is the idea behind Gunrock’s idempotent discovery that lets them get away without solving multiway-merge completelySome duplicates are allowed

Design Considerations1.Column-based SpMV2.How to solve multiway-merge3.Partitioning scheme

A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1x2

x x3 = + + +...+ = ...xp

Each iteration:1) GPU i computes y(i) = Aixi2) Computes histogram of send counts3) Send to other GPUs using MPI_Alltoall with send counts4) Send to other GPUs using MPI_Alltoallv5) Each GPU multiway-merges piece they are responsible for

3. Partitioning scheme


c

A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1 y1x2

x x3 = + + +...+ = ...xp

GPU1Each iteration:1) GPU i computes y(i) = Aixi2) Computes histogram of send counts3) Send to other GPUs using MPI_Alltoall with send counts4) Send to other GPUs using MPI_Alltoallv5) Each GPU multiway-merges piece they are responsible for

A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1 x2 y2

x x3 = + + +...+ = ...xp



c


A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1 y1x2

x x3 = + + +...+ = ...xp

Each iteration:1) GPU i computes y(i) = Aixi2) Computes histogram of send counts3) Send to other GPUs using MPI_Alltoall with send counts4) Send to other GPUs using MPI_Alltoallv5) Each GPU multiway-merges piece they are responsible for


c

A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1 x2 y2

x x3 = + + +...+ = ...xp


Block Matrix (Multi-GPU)On GPU 1, 2, …, p:

mXv (Single GPU)

Gather columns from Ai

Multiway-merge

Input: Sparse Vector xiSparse Matrix Ai

ewiseMult

Output: Sparse Vector yi

mXv (Single GPU)

MPI_Alltoallv

MPI_Alltoall

Multiway-merge

Histogram

c

c

Why 2 multiway-merges?

Experimental SetuplCPUl64x 16-core AMD Opteron 6274 CPUl32 GB of main memory

lGPUl64x NVIDIA K20X GPUs l14 SMs, 192 ALUs per SM, 732 MHz, 6 GB on-board memory

lInterconnectlGemini interconnect between nodesl10.4 GB/s (HyperTransport 3.0)

Strong Scaling

0.5

1

2

4

8

16

1 2 4 8 16 32 64

Billions of Edges Traversed (GTEPS)

Number of Processors

Ideal ScalingActual Scaling

Weak Scaling (Vertex)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

1 2 4 8 16 32

Billions of Edges Traversed

per Processor (GTEPS)


Ideal ScalingActual Scaling

For 4 GPUs, everything looks fine

Majority of runtime is compute• Two biggest iterations brought down from 121ms to 40ms

Compute time > Communication

MPI calls comprise a sliver of runtime

At 16 GPUs, picture becomes bleak

MPI calls take up ~50% runtime now

Not overlapping compute with communication

GPUs sit idle to wait for other GPUs to finish compute

Communication vs. Compute

0

1

2

3

4

5

6

7

1 2 4 8 16

Relative Cost per Processor


CommunicationCompute

How to bridge this gap?

Work-in-progress1.Overlap compute with communication

Use multiple cudaStreams to make communication asynchronous

2.Try other partitioning schemesVariants of 1D column-based partitioning2D partitioning

3.Partition graph betterReverse Cuthill–McKee

ConclusionRow-based vs. Column-based implementation for mXvBoth have their uses

Should sparse vector tolerate duplicate entries if user does not care about intermediate sparse vectors?

Acknowledgements

supportThis work is funded by:DARPA XDATA program US Army award W911QX-12-C-0059

DARPA STTR award D15PC00010Department of Energy, Office of Science, ASCR Contract No. DE-AC02-05CH11231

Questions?Aydın Buluç, Lawrence Berkeley National [email protected]

John D. Owens, University of Davis, [email protected]

Backup Slides

Applications of Interest

• Bioinformatics• Social network analysis• Computer vision• Fraud detection• Urban planning

Challenges with Graph AnalysisToo big to visualize

Software complexityStructural properties varySmall-world: Difficult to partitionImpacts scalabilityMesh:Limits scalability of synchronous graph algorithms

AT x y

x =

• GPU threads compute each row simultaneously• Work: O(nnz)• Suitable for multiplication by dense vector, because work is

independent of vector sparsity• Direction-optimizing or bottom-up BFS is a special case of this

0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0

1. Row-based mXv (SpMV)

00214

20864

Dominance of Big Iterations in BFS of Scale-free Graphs

Two biggest iterations take 91% of runtime• Other six iterations take 6ms combined

Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 ·...

Documents

Transcript of Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 ·...