Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A....
Transcript of Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A....
![Page 1: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/1.jpg)
1
Linear Algebraic Primitives for Parallel Computing on Large Graphs
BuluçPh.D. Defense March 16, 2010
Committee: John R. Gilbert (Chair)ÖmerFred ChongShivkumar Chandrasekaran
![Page 2: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/2.jpg)
Sources of Massive Graphs
(WWW snapshot, courtesy Y. Hyun) (Yeast protein interaction network, courtesy H. Jeong)
Graphs naturally arise from the internet and social interactions
Many scientific (biological, chemical,cosmological, ecological, etc) datasets are modeled as graphs.
![Page 3: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/3.jpg)
Examples:- Centrality- Shortest paths- Network flows- Strongly Connected
Components
Examples:
- Loop and multi edge removal- Triangle/Rectangle enumeration
3
Types of Graph Computations
Fuzzy intersectionExamples: Clustering,
Algebraic Multigrid
Tool: GraphTraversal Tool: Map/Reduce
1 3
4
2 5
7
6
Tightly coupled
Filtering based
map map map
reduce reduce
![Page 4: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/4.jpg)
Tightly Coupled Computations
A. Computationally intensive graph/data mining algorithms.
(e.g. graph clustering, centrality, dimensionality reduction)
B. Inherently latency-bound computations.
(e.g. finding s-t connectivity)
4
Huge Graphs Expensive Kernels+ High Performance and Massive Parallelism
Tightly Coupled Computations
Let the special purpose architectures handle (B), but what can we do for (A) with the commodity architectures?
We need memory ! Disk is not an option !
![Page 5: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/5.jpg)
555
Software for Graph Computation
spending ten years of my life on the TeX project is that software is hard. It's harder than anything
![Page 6: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/6.jpg)
666
Software for Graph Computation
spending ten years of my life on the TeX project is that software is hard. It's harder than anything
Dealing with software is hard !
![Page 7: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/7.jpg)
777
Software for Graph Computation
spending ten years of my life on the TeX project is that software is hard. It's harder than anything
Dealing with software is hard !
High performance computing (HPC)
software is harder !
![Page 8: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/8.jpg)
888
Software for Graph Computation
spending ten years of my life on the TeX project is that software is hard. It's harder than anything
Dealing with software is hard !
High performance computing (HPC)
software is harder !
Deal with parallel HPC software?
![Page 9: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/9.jpg)
Tightly Coupled Computations
9
Parallel Graph Software
Shared memory scales, not it pays off to target productivity
Distributed memory should be targeting p > 100 at least
Shared and distributed memory complement each other.
Coarse-grained parallelism for distributed memory.
![Page 10: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/10.jpg)
Outline
The Case for PrimitivesThe Case for Sparse Matrices
Parallel Sparse Matrix-Matrix Multiplication
Software Design of the Combinatorial BLAS
An Application in Social Network Analysis
Other Work
Future Directions
10
![Page 11: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/11.jpg)
11
Input:Find least-cost paths between all reachable vertex pairsClassical algorithm: Floyd-Warshall
Case study of implementation on multicore architecture:graphics processing unit (GPU)
All-Pairs Shortest Paths
for k=1:n // the induction sequencefor i = 1:n
for j = 1:nif(w(i k)+w(k ) < w(i ))
w(i ):= w(i k) + w(k )
1 52 3 41
5
2
3
4
k = 1 case
![Page 12: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/12.jpg)
12
GPU characteristics
Powerful: two Nvidia 8800s > 1 TFLOPS
Inexpensive: $500 each
Difficult programming model: One instruction stream drives 8 arithmetic units
Performance is counterintuitive and fragile:Memory access pattern has subtle effects on cost
Extremely easy to underutilize the device:Doing it wrong easily costs 100x in time
t1
t3t2
t4
t6t5
t7
t9t8
t10
t12t11
t13t14
t16t15
But:
![Page 13: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/13.jpg)
13
Recursive All-Pairs Shortest Paths
A B
C DA
BD
C
A = A*; % recursive callB = AB; C = CA; D = D + CB;D = D*; % recursive callB = BD; C = DC;A = A + BC;
+ ×
Based on R-Kleene algorithm
Well suited for GPU architecture:Fast matrix-multiply kernelIn-place computation => low memory bandwidthFew, large MatMul calls => low GPU dispatch overheadRecursion stack on host CPU, not on multicore GPUCareful tuning of GPU code
![Page 14: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/14.jpg)
The Case for Primitives
14
480x
Lifting Floyd-Warshallto GPU
The right primitive !
Conclusions:
High performance is achievable but not simple
Carefully chosen and optimized primitives will be key
Unorthodox R-Kleene algorithm
APSP: Experiments and observations
![Page 15: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/15.jpg)
Outline
The Case for Primitives
The Case for Sparse MatricesParallel Sparse Matrix-Matrix Multiplication
Software Design of the Combinatorial BLAS
An Application in Social Network Analysis
Other Work
Future Directions
15
![Page 16: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/16.jpg)
16
Sparse Adjacency Matrix and Graph
Every graph is a sparse matrix and vice-versa
Adjacency matrix: sparse array w/ nonzeros for graph edges
Storage-efficient implementation from sparse data structures
x
1 2
3
4 7
6
5
AT
![Page 17: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/17.jpg)
The Case for Sparse Matrices
Many irregular applications contain su cient coarse-grained parallelism that can ONLY be exploited using abstractions at proper level.
17
Traditional graph computations
Graphs in the language of linear algebra
Data driven. Unpredictable communication.
Fixed communication patterns.
Irregular and unstructured. Poor locality of reference
Operations on matrix blocks. Exploits memory hierarchy
Fine grained data accesses. Dominated by latency
Coarse grained parallelism. Bandwidth limited
The Case for Sparse Matrices
![Page 18: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/18.jpg)
Identification of Primitives
Sparse matrix-matrix Multiplication (SpGEMM)
Element-wise operations
18
Linear Algebraic Primitives
x
Matrices on semirings, e.g. ( , +), (and, or), (+, min)
Sparse matrix-vector multiplication
Sparse Matrix Indexing
x
.*
![Page 19: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/19.jpg)
Why focus on SpGEMM?
Graph clustering (MCL, peer pressure)
Subgraph / submatrix indexingShortest path calculations
Betweenness centralityGraph contraction
Cycle detection
Multigrid interpolation & restriction
Colored intersection searching
Applying constraints in finite element computations
Context-free parsing ...
19
Applications of Sparse GEMM
11
11
1x x
Loop until convergenceA = A * A; % expandA = A .^ 2; % inflate sc = 1./sum(A, 1);A = A * diag(sc); % renormalizeA(A < 0.0001) = 0; % prune entries
![Page 20: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/20.jpg)
Outline
The Case for Primitives
The Case for Sparse Matrices
Parallel Sparse Matrix-Matrix MultiplicationSoftware Design of the Combinatorial BLAS
An Application in Social Network Analysis
Other Work
Future Directions
20
![Page 21: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/21.jpg)
21
Two Versions of Sparse GEMM
A1 A2 A3 A4 A7A6A5 A8 B1 B2 B3 B4 B7B6B5 B8 C1 C2 C3 C4 C7C6C5 C8
j
x =i
kk
Cij
Cij += Aik Bkj
Ci = Ci + A Bi
x =1D block-column distribution
Checkerboard (2D block) distribution
![Page 22: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/22.jpg)
Comparative Speedup of Sparse 1D & 2D
22
In practice, 2D algorithms have the potential to scale, but not linearly
Projected performances of Sparse 1D & 2D
N P
1D 2D
NP
pppncncpcnnc
TTpWE
commcomp22
2
lg)()(
![Page 23: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/23.jpg)
Stores entries in column-major order
Dense collection of
Uses storage.23
Compressed Sparse Columns (CSC):A Standard Layout
Column pointers
data
810
2 3rowindn×n matrix with nnz nonzeroes
n
0 2 3 4 0 1 5 7 3 4 5 4 5 6 7
04
1112131617
nnznO )(
![Page 24: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/24.jpg)
24
Submatrices are hypersparse (i.e. nnz << n)
blocks
blocks
Total Storage:
Average of c nonzeros per column
A data structure or algorithm that depends on the matrix dimension n (e.g. CSR or CSC) is asymptotically too wasteful for submatrices
Node Level Considerations
![Page 25: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/25.jpg)
Sequential Kernel
Strictly O(nnz) data structure Outer-product formulation Work-efficient
25
X
flops
nnz
n
Standard
Sequential Hypersparse Kernel
))()(lg( BnzrAnzcnif lops
))(( mnBnnzf lops
New hypersparse kernel:
![Page 26: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/26.jpg)
26
Scaling Results for SpGEMM
RMat X RMat product (graphs with high variance on degrees)
Random permutations useful for the overall computation (<10% imbalance).
Bulk synchronous algorithms may still suffer due to imbalance within the stages. But this goes away as matrices get larger
0
20
40
60
80
100
120
140
160
180
200
0 200 400 600 800 1000 1200 1400 1600
Speedup
Processors
SCALE 21 -‐ 16way
SCALE 22 -‐ 16way
SCALE 22 -‐ 4way
SCALE 23 -‐ 16way
![Page 27: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/27.jpg)
Outline
The Case for Primitives
The Case for Sparse Matrices
Parallel Sparse Matrix-Matrix Multiplication
Software Design of the Combinatorial BLASAn Application in Social Network Analysis
Other Work
Future Directions
27
![Page 28: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/28.jpg)
28
Software design of theCombinatorial BLAS
Generality, of the numeric type of matrix elements, algebraic operation performed, and the library interface.Without the language abstraction penalty: C++ Templates
Achieve mixed precision arithmetic: Type traitsEnforcing interface and strong type checking: CRTPGeneral semiring operation: Function Objects
template <class IT, class NT, class DER>class SpMat;
Abstraction penalty is not just a programming language issue.
In particular, view matrices as indexed data structures and stay away from single element access (Interface should discourage)
![Page 29: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/29.jpg)
29
Extendability, of the library while maintaining compatibility and seamless upgrades.➡ Decouple parallel logic from the sequential part.
TuplesCSC DCSC
SpSeq
Commonalities:- Support the sequential API- Composed of a number of arrays
SpSeq
SpPar<Comm, SpSeq>
Any parallel logic: asynchronous, bulk synchronous, etc
Software design of theCombinatorial BLAS
Each *may* support an iterator concept
![Page 30: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/30.jpg)
Outline
The Case for Primitives
The Case for Sparse Matrices
Parallel Sparse Matrix-Matrix Multiplication
Software Design of the Combinatorial BLAS
An Application in Social Network AnalysisOther Work
Future Directions
30
![Page 31: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/31.jpg)
Applications and Algorithms
31
Betweenness Centrality (BC)CB(v): Among all the shortest paths, what fraction of them pass through the node of interest?
Brandes
A typical software stack for an application enabled with the Combinatorial BLAS
Social Network Analysis
![Page 32: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/32.jpg)
32
6
XAT (ATX).*¬X
1 2
3
4 7 5
Betweenness Centrality using Sparse GEMM
Parallel breadth-first search is implemented with sparse matrix-matrix multiplication
Work efficient algorithm for BC
![Page 33: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/33.jpg)
33
BC Performance onDistributed-memory
TEPS: Traversed Edges Per SecondBatch of 512 vertices at each iterationCode only a few lines longer than Matlab version
0
50
100
150
200
250
25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400 441 484
TEPS score
Millions
Number of Cores
BC performance
Scale 17Scale 18Scale 19Scale 20
Input: RMAT scale N2N verticesAverage degree 8
Pure MPI-1 version. No reliance on any particular hardware.
![Page 34: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/34.jpg)
34
More scaling and sensitivity (Betweenness Centrality)
Input is an RMat graph of scale 22
0
50
100
150
200
250
300
350
400
196 256 324 400 484 576 676 784 900
Millions of TEPS
Number of Processing Cores
Sensitivity to batch processing
Batch=128 Batch=256 Batch=512 Batch=1024Experiment with 512 batch nodes
0
50
100
150
200
250
300
350
400
196 366 536 706 876 1046 1216
Scaling beyond hundreds
Scale 21 Scale 22 Scale 23 Scale 24
![Page 35: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/35.jpg)
Outline
The Case for Primitives
The Case for Sparse Matrices
Parallel Sparse Matrix-Matrix Multiplication
Software Design of the Combinatorial BLAS
An Application in Social Network Analysis
Other WorkFuture Directions
35
![Page 36: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/36.jpg)
36
SpMV on Multicore
Our parallel algorithms for y ATx thenew compressed sparse blocks (CSB) layout have
span, and work, yielding parallelism.)lg/( nnnnz
)(nnz)lg( nn
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8
MFlops/sec
Processors
Our CSB algorithms
Star-P (CSR+blockrow distribution)
Serial (Naïve CSR)
![Page 37: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/37.jpg)
Outline
The Case for Primitives
The Case for Sparse Matrices
Parallel Sparse Matrix-Matrix Multiplication
Software Design of the Combinatorial BLAS
An Application in Social Network Analysis
Other Work
Future Directions
37
![Page 38: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/38.jpg)
3838
Future Directions
‣Novel scalable algorithms
‣Static graphs are just the beginning.Dynamic graphs, Hypergraphs, Tensors
‣ Architectures (mainly nodes) are evolvingHeterogeneous multicoresHomogenous multicores with more cores per node
TACC Lonestar (2006)
4 cores / node
TACC Ranger (2008)
16 cores / node
SDSC Triton (2009)
32 cores / nodeXYZ Resource (2020)
Hierarchical parallelism
![Page 39: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/39.jpg)
3939
‣What did I learn about parallel graph computations?
‣No silver bullets.
‣Graph computations are not homogenous.
‣No free lunch.
‣ If communication >>computation for your algorithm, overlapping them will NOT make it scale.
‣Scale your algorithm, not your implementation.
‣ Any sustainable speedup is a success !
‣
![Page 40: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/40.jpg)
40
Related Publications
Hypersparsity in 2D decomposition, sequential kernel.B., Gilbert, "On the Representation and Multiplication of Hypersparse
Parallel analysis of sparse GEMM, synchronous implementationB., Gilbert, "Challenges and Advances in Parallel Sparse Matrix-
The case for primitives, APSP on the GPUB., Gilbert, Budak, , Parallel Computing, 2009
SpMV on MulticoresB., Fineman, Frigo, Gilbert, Leiserson, "Parallel Sparse Matrix-Vector and Matrix-Transpose-
Betweenness centrality resultsB., Gilbert, -
Software design of the libraryB., Gilbert,
![Page 41: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,](https://reader033.fdocuments.us/reader033/viewer/2022060222/5f0791757e708231d41da1a9/html5/thumbnails/41.jpg)
41
David Bader, Erik Boman, Ceren Budak, Alan Edelman, Jeremy Fineman, MatteoFrigo, Bruce Hendrickson, Jeremy
Kepner, Charles Leiserson, KameshMadduri, Steve Reinhardt, Eric Robinson, Viral
Shah, Sivan Toledo