Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages...
Transcript of Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages...
Reducing Communication in Sparse Matrix Operations 2018 Blue Waters SymposiumLuke OlsonDepartment of Computer Science, University of Illinois at Urbana-Champaign
Collaborators on this allocation: Amanda Bienz, University of Illinois at Urbana-ChampaignBill Gropp, University of Illinois at Urbana-ChampaignAndrew Reisner, University of Illinois at Urbana-ChampaignLukas Spies, University of Illinois at Urbana-Champaign
Figure: XPACC @ Illinois
Time Stepping
Sparse Matrix Operations
Figure: MD Anderson Figure: Fischer @ Illinois
PCA / ClusteringLinear Systems
w A ⇤ v
C A ⇤B
C R ⇤A ⇤RT
w A�1v
Sparse Matrix-Vector multiplication (SpMV)
Figure:QMCpack
Eigen analysis
What is this talk about? (Why it matters)
• 10s, 100s, 1000s, … of SpMVs in a computation
• SpMV is a major kernel, but is limited efficiency and limited scalability
• Use machine layout (nodes) on Blue Waters to reduce communication
• Use consistent timings on Blue Waters to develop accurate performance models
Iterative method for solvingAx = b
while...
↵ hr, zi/hAp, pix x+ ↵p
r+ r � ↵Ap
z+ precond(r)
� hr+, z+i/hr, zip z + �p
CA algorithms, see Eller/Gropp
SpMV2…10…100 SpMVs
p = 0
p = 1
p = 2
p = 3
p = 4
p = 5
• Solid blocks: on-process portion • Patterned blocks: off-process portion (requires communication of the input vector)
Anatomy of a Sparse Matrix-Vector (SpMV) product
w A v
P0
P1
P2
P3
w A ⇤ v
Data layout Where data is sent
• Modeling difficult (more later) • Basic SpMV: rows-per-process layout
Cost of a Sparse Matrix-Vector (SpMV) product
500K 100K 50KNon-zeros per core
0
20
40
60
80
100
%of
Tim
ein
Com
mun
icat
ion
Process ID Process ID
Proc
ess
ID
SpMVAll-reduce SpMV
nlpkkt240
Case Study: Preconditioning (Algebraic Multigrid)
• AMG: Algebraic Multigrid iteratively whittles away at the error
• Series or hierarchy of successively smaller (and more dense) sparse matrices
• SpMV dominated
x x+ !Ar
x x+ !A1r
x x+ !A2r
…
x x+ !Ar
x x+ !A1r
x x+ !A2r
A0
A1
A2
A3
nnz
n rows= 30
nnz
n rows= 64
nnz
n rows= 66
nnz
n rows= 26 Level 0
Level 1
Level 2
Level 3
Case Study: Preconditioning (Algebraic Multigrid)
• MFEM discretization
• Linear elasticity
• 8192 cores, 512 nodes, 10k dof / core
0 5 10 15 20 25Level in AMG Hierarchy
10�4
10�3
10�2
Tim
e(S
econ
ds)
Smaller matrices == more communication
Observation 1: message volume between procs
Maximum number of messages
Maximum size of messages
0 5 10 15 20AMG Level
102
103
Max
Num
ber
ofM
essa
ges
0 5 10 15 20AMG Level
104
105
Max
Mes
sage
sSi
ze(b
ytes
)
1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3. off node > on node > on socket
Observation 2: limits of communication
T = ↵+ppn · s
min (RN , ppn ·RB)
latency message size
Bandwidth between two processes
Node injection bandwidth
Modeling MPI Communication Performance on SMP Nodes: Is it Time to Retire the Ping Pong Test,Gropp, Olson, Samfass, EuroMPI 2016.
1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3. off node > on node > on socket
Observation 3: node locality
100 101 102 103 104 105 106
Number of Bytes Communicated
10�6
10�5
10�4
Tim
e(s
econ
ds
Network (PPN � 4)
Network (PPN < 4)
On-Node
On-Socket
• Split into short, eager, rendezvous
• Partition into on-socket, on-node, and off-node
1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3. off node > on node > on socket
Anatomy of a node level SpMV product
P0
P1
P2
P3
P4
P5
N0 N1 N2
Six processes distributed across three nodes
Linear system distributed across the processes
w A v
P0
P1
P2
P3
P4
P5
Standard Communication
n m
q
Node Node
core
Standard Communication
n m
p
Node Node
core
New Algorithm: On-Node Communication
n
p
New Algorithm: Off-Node Communication
n m
p
q
New Algorithm: Off-Node Communication
n m
p
q
New Algorithm: Off-Node Communication
n m
p
q
New Algorithm: Off-Node Communication
n m
p
q
Node-Aware Parallel (NAP) Matrix Operation
1.) Redistribute initial values
n m
p
q
n m
p
q
2.) Inter-node communication
3.) Redistribute received values 4.) On-nodecommunication
5.) Local computationwith on-process, on-node,
and off-node portionsof Matrix
Note: step 4 and portions of step 5 can overlap with steps 1, 2, and 3
n m
p
q
n
p
Case Study: Preconditioning (Algebraic Multigrid)
Off-node On-node
Maximum number of messages sent from any process on 16,384 processes
0 5 10 15 20AMG Level
101
Max
Num
ber
ofO
n-N
ode
Mes
sage
s
ref. SpMV TAPSpMV
0 5 10 15 20AMG Level
101
102
103
Max
O↵-
Nod
eN
umber
ofM
essa
ges
ref. SpMV TAPSpMV
Case Study: Preconditioning (Algebraic Multigrid)
Maximum size of messages sent from any process on 16,384 processes
0 5 10 15 20AMG Level
101
102
103
Max
On-
Nod
eM
essa
ges
Size
(byt
es)
ref. SpMV TAPSpMV
0 5 10 15 20AMG Level
103
104
105
Max
O↵-
Nod
eM
essa
ges
Size
(byt
es)
ref. SpMV TAPSpMV
Off-node On-node
Case Study: Preconditioning (Algebraic Multigrid)
0 5 10 15 20AMG Level
10�3
10�2
Tim
e(s
econ
ds)
ref. SpMV TAPSpMV
0 2000 4000 6000 8000 10000 12000 14000 16000 18000Number of Processes
10�1
Tim
e(s
econ
ds)
ref. SpMV TAPSpMV
Total Time Strong Scaling
Node aware sparse matrix-vector multiplication,Bienz, Gropp, Olson, in review JPDC, 2018. Arxiv
Cost analysis on Blue Waters
• Blue Waters provided a unique setting for two aspects: 1. Model MPI queueing times 2. Model network contention
100 101 102 103 104
Number of Messages Communicated
10�6
10�5
10�4
10�3
10�2
10�1
100
Tim
e(s
econ
ds)
16 Bytes
64 Bytes
256 Bytes
1024 Bytes
4096 Bytes
16384 Bytes
65536 Bytes
262144 Bytes
100 101 102 103 104
Number of Messages Communicated
10�6
10�5
10�4
10�3
10�2
10�1
100
Tim
e(s
econ
ds)
16 Bytes
64 Bytes
256 Bytes
1024 Bytes
4096 Bytes
16384 Bytes
65536 Bytes
262144 Bytes
• MPI Irecv message queue costly • Identified a quadratic cost
Cost analysis on Blue Waters
• Blue Waters provided a unique setting for two aspects: 1. Model MPI queueing times 2. Model network contention
• Network contention is costly • Identified a hop model
100 101 102 103 104
Number of Messages Communicated
10�5
10�4
10�3
10�2
10�1
100
Tim
e(s
econ
ds)
16 Bytes
64 Bytes
256 Bytes
1024 Bytes
4096 Bytes
16384 Bytes
65536 Bytes
262144 Bytes
G0 G1 G2 G3
100 101 102 103 104
Number of Messages Communicated
10�5
10�4
10�3
10�2
10�1
100
Tim
e(s
econ
ds)
16 Bytes
64 Bytes
256 Bytes
1024 Bytes
4096 Bytes
16384 Bytes
65536 Bytes
262144 Bytes
Cost analysis on Blue Waters
• Blue Waters provided a unique setting for two aspects: 1. Model MPI queueing times 2. Model network contention
0 1 2 3 4 5 6Level in AMG Hierarchy
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
Tim
e(s
econ
ds)
Measured
Max-Rate
Queue Search
Contention
Improving Performance Models for Irregular Point-to-Point Communication, Bienz, Gropp, Olson, in review EuroMPI, 2018.
Summary and Ongoing Work• Drop in replacement for a range of Sparse Matrix operations
(SpMV, SPMM, MIS(k), assembly operations, etc) • Blue Waters instrumental in testing at scale, reproducible
outcomes, and accurate performance analysis. • (this) Code base: https://github.com/lukeolson/raptor
• Structured code base: https://github.com/cedar-framework/cedar
This research is part of the Blue Waters sustained petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.