1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University...
-
date post
20-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University...
![Page 1: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/1.jpg)
1
High-Performance Computation for Path Problems in Graphs
Aydin BuluçJohn R. GilbertUniversity of California, Santa Barbara
SIAM Conf. on Applications of Dynamical SystemsMay 20, 2009
Support: DOE Office of Science, MIT Lincoln Labs, NSF, DARPA, SGI
![Page 2: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/2.jpg)
2
Horizontal-vertical decomposition [Mezic et al.]
Slide courtesy of Igor Mezic group, UCSB
![Page 3: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/3.jpg)
3
Combinatorial Scientific Computing
“I observed that most of the coefficients in our matrices were zero; i.e., the nonzeros were ‘sparse’ in the matrix, and that typically the triangular matrices associated with the forward and back solution provided by Gaussian elimination would remain sparse if pivot elements were chosen with care”
- Harry Markowitz, describing the 1950s work on portfolio theory that won the 1990 Nobel Prize for Economics
![Page 4: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/4.jpg)
4
A few directions in CSC
• Hybrid discrete & continuous computations• Multiscale combinatorial computation• Analysis, management, and propagation of uncertainty• Economic & game-theoretic considerations• Computational biology & bioinformatics• Computational ecology• Knowledge discovery & machine learning• Relationship analysis • Web search and information retrieval• Sparse matrix methods• Geometric modeling• . . .
![Page 5: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/5.jpg)
5
The Parallel Computing Challenge
LANL / IBM Roadrunner> 1 PFLOPS
Two Nvidia 8800 GPUs> 1 TFLOPS
Intel 80-core chip> 1 TFLOPS Parallelism is no longer optional…
… in every part of a computation.
![Page 6: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/6.jpg)
6
Efficient sequential algorithms for graph-theoretic
problems often follow long chains of dependencies
Several parallelization strategies, but no silver bullet:
Partitioning (e.g. for preconditioning PDE solvers)
Pointer-jumping (e.g. for connected components)
Sometimes it just depends on what the input looks like
A few simple examples . . .
The Parallel Computing Challenge
![Page 7: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/7.jpg)
7
Sample kernel: Sort logically triangular matrix
• Used in sparse linear solvers (e.g. Matlab’s)
• Simple kernel, abstracts many other graph operations (see next)
• Sequential: linear time, simple greedy topological sort
• Parallel: no known method is efficient in both work and span: one parallel step per level; arbitrarily long dependent chains
Original matrix Permuted to unit upper triangular form
![Page 8: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/8.jpg)
8
Bipartite matching
• Perfect matching: set of edges that hits each vertex exactly once
• Matrix permutation to place nonzeros (or heavy elements) on diagonal
• Efficient sequential algorithms based on augmenting paths
• No known work/span efficient parallel algorithms
1 52 3 41
5
2
3
4
A
1
5
2
3
4
1
5
2
3
4
1 52 3 44
2
5
3
1
PA
![Page 9: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/9.jpg)
9
Strongly connected components
• Symmetric permutation to block triangular form
• Diagonal blocks are strong Hall (irreducible / strongly connected)
• Sequential: linear time by depth-first search [Tarjan]
• Parallel: divide & conquer, work and span depend on input [Fleischer, Hendrickson, Pinar]
1 52 4 7 3 61
5
2
4
7
3
6
PAPT G(A)
1 2
3
4 7
6
5
![Page 10: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/10.jpg)
10
Horizontal - vertical decomposition
• Defined and studied by Mezic et al. in a dynamical systems context
• Strongly connected components, ordered by levels of DAG
• Efficient linear-time sequential algorithms
• No work/span efficient parallel algorithms known
3 54
9
7
1
8
6
2
level 1
level 2
level 3
level 45 96 7 81 2 3 4
1
5
2
3
4
9
6
7
8
![Page 11: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/11.jpg)
11
Strong components of 1M-vertex RMAT graph
![Page 12: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/12.jpg)
12
Dulmage-Mendelsohn decomposition
1
5
2
3
4
6
7
8
12
9
10
11
1 52 3 4 6 7 8 9 10 11
1
2
5
3
4
7
6
10
8
9
12
11
1
2
3
5
4
7
6
9
8
11
10
HR
SR
VR
HC
SC
VC
![Page 13: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/13.jpg)
13
Applications of D-M decomposition
• Strongly connected components of directed graphs
• Connected components of undirected graphs
• Permutation to block triangular form for Ax=b
• Minimum-size vertex cover of bipartite graphs
• Extracting vertex separators from edge cuts for arbitrary graphs
• Nonzero structure prediction for sparse matrix factorizations
![Page 14: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/14.jpg)
14
Strong Hall components are independent of choice of matching
1 52 4 7 3 61
5
2
4
7
3
6
1 52 4 7 3 64
5
1
7
2
6
3
1
5
2
3
4
1
5
2
3
4
7
6
7
6
1
5
2
3
4
1
5
2
3
4
7
6
7
6
![Page 15: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/15.jpg)
15
• By analogy to numerical linear algebra. . .
• What should the “combinatorial BLAS” look like?
The Primitives Challenge
C = A*B
y = A*x
μ = xT y
Basic Linear Algebra Subroutines (BLAS):Speed (MFlops) vs. Matrix Size (n)
![Page 16: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/16.jpg)
16
Primitives for HPC graph programming
• Visitor-based multithreaded [Berry, Gregor, Hendrickson, Lumsdaine]
+ search templates natural for many algorithms
+ relatively simple load balancing
– complex thread interactions, race conditions
– unclear how applicable to standard architectures
• Array-based data parallel [G, Kepner, Reinhardt, Robinson, Shah]
+ relatively simple control structure
+ user-friendly interface
– some algorithms hard to express naturally
– load balancing not so easy
• Scan-based vectorized [Blelloch]
• We don’t know the right set of primitives yet!
![Page 17: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/17.jpg)
17
Array-based graph algorithms study [Kepner, Fineman, Kahn, Robinson]
![Page 18: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/18.jpg)
18
Multiple-source breadth-first search
X
1 2
3
4 7
6
5
AT
![Page 19: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/19.jpg)
19
XAT ATX
1 2
3
4 7
6
5
Multiple-source breadth-first search
![Page 20: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/20.jpg)
20
• Sparse array representation => space efficient
• Sparse matrix-matrix multiplication => work efficient
• Span & load balance depend on matrix-mult implementation
XAT ATX
1 2
3
4 7
6
5
Multiple-source breadth-first search
![Page 21: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/21.jpg)
21
Matrices over semirings
• Matrix multiplication C = AB (or matrix/vector):
Ci,j = Ai,1B1,j + Ai,2B2,j + · · · + Ai,nBn,j
• Replace scalar operations and + by
: associative, distributes over , identity 1
: associative, commutative, identity 0 annihilates under
• Then Ci,j = Ai,1B1,j Ai,2B2,j · · · Ai,nBn,j
• Examples: (,+) ; (and,or) ; (+,min) ; . . .
• Same data reference pattern and control flow
![Page 22: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/22.jpg)
22
• Shortest path calculations (APSP)
• Betweenness centrality
• BFS from multiple source vertices
• Subgraph / submatrix indexing
• Graph contraction
• Cycle detection
• Multigrid interpolation & restriction
• Colored intersection searching
• Applying constraints in finite element modeling
• Context-free parsing
SpGEMM: Sparse Matrix x Sparse Matrix [Buluc, G]
![Page 23: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/23.jpg)
23
Distributed-memory parallel sparse matrix multiplication
j
* =i
kk
Cij
Cij += Aik * Bkj
2D block layout Outer product formulation Sequential “hypersparse” kernel
• Asynchronous MPI-2 implementation
• Experiments: TACC Lonestar cluster
• Good scaling to 256 processors
Time vs Number of cores -- 1M-vertex RMAT
![Page 24: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/24.jpg)
24
• Directed graph with “costs” on edges
• Find least-cost paths between all reachable vertex pairs
• Several classical algorithms with
– Work ~ matrix multiplication
– Span ~ log2 n
• Case study of implementation on multicore architecture:
– graphics processing unit (GPU)
All-Pairs Shortest Paths
![Page 25: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/25.jpg)
25
GPU characteristics
• Powerful: two Nvidia 8800s > 1 TFLOPS
• Inexpensive: $500 each
• Difficult programming model:
One instruction stream drives 8 arithmetic units
• Performance is counterintuitive and fragile:
Memory access pattern has subtle effects on cost
• Extremely easy to underutilize the device:
Doing it wrong easily costs 100x in time
t1
t3
t2
t4
t6
t5
t7
t9
t8
t10
t12
t11
t13
t14
t16
t15
But:
![Page 26: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/26.jpg)
26
Recursive All-Pairs Shortest Paths
A B
C DA
B
DC
A = A*; % recursive call
B = AB; C = CA;
D = D + CB;
D = D*; % recursive call
B = BD; C = DC;
A = A + BC;
+ is “min”, × is “add”
Based on R-Kleene algorithm
Well suited for GPU architecture:
• Fast matrix-multiply kernel
• In-place computation => low memory bandwidth
• Few, large MatMul calls => low GPU dispatch overhead
• Recursion stack on host CPU,
not on multicore GPU
• Careful tuning of GPU code
![Page 27: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/27.jpg)
27
Execution of Recursive APSP
![Page 28: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/28.jpg)
28
APSP: Experiments and observations
128-core Nvidia 8800
Speedup relative to. . .
1-core CPU: 120x – 480x
16-core CPU: 17x – 45x
Iterative, 128-core GPU: 40x – 680x
MSSSP, 128-core GPU: ~3x
Conclusions:
• High performance is achievable but not simple
• Carefully chosen and optimized primitives will be key
Time vs. Matrix Dimension
![Page 29: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/29.jpg)
29
H-V decomposition
A span-efficient, but not work-efficient, method for H-V
decomposition uses APSP to determine reachability…
3 54
9
7
1
8
6
2
level 1
level 2
level 3
level 45 96 7 81 2 3 4
1
5
2
3
4
9
6
7
8
![Page 30: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/30.jpg)
30
Reachability: Transitive closure
3 54
9
7
1
8
6
2
level 1
level 2
level 3
level 45 96 7 81 2 3 4
1
5
2
3
4
9
6
7
8
• APSP => transitive closure of adjacency matrix
• Strong components identified by symmetric nonzeros
![Page 31: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/31.jpg)
31
H-V structure: Acyclic condensation
1 82 3 6 9
1
8
2
3
6
9
345
9
67
1
8
2
level 1
level 2
level 3
level 4
• Acyclic condensation is a sparse matrix-matrix product
• Levels identified by “APSP” for longest paths
• Practically speaking, a parallel method would compromise between work and span efficiency
![Page 32: 1 High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications.](https://reader033.fdocuments.us/reader033/viewer/2022052701/56649d425503460f94a1d8cb/html5/thumbnails/32.jpg)
32
Remarks
• Combinatorial algorithms are pervasive in scientific
computing and will become more so.
• Path computations on graphs are powerful tools, but
efficiency is a challenge on parallel architectures.
• Carefully chosen and implemented primitive operations
are key.
• Lots of exciting opportunities for research!