Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages...

Reducing Communication in Sparse Matrix Operations 2018 Blue Waters SymposiumLuke OlsonDepartment of Computer Science, University of Illinois at Urbana-Champaign

Collaborators on this allocation: Amanda Bienz, University of Illinois at Urbana-ChampaignBill Gropp, University of Illinois at Urbana-ChampaignAndrew Reisner, University of Illinois at Urbana-ChampaignLukas Spies, University of Illinois at Urbana-Champaign

Figure: XPACC @ Illinois

Time Stepping

Sparse Matrix Operations

Figure: MD Anderson Figure: Fischer @ Illinois

PCA / ClusteringLinear Systems

w A ⇤ v

C A ⇤B

C R ⇤A ⇤RT

w A�1v

Sparse Matrix-Vector multiplication (SpMV)

Figure:QMCpack

Eigen analysis

What is this talk about? (Why it matters)

• 10s, 100s, 1000s, … of SpMVs in a computation

• SpMV is a major kernel, but is limited efficiency and limited scalability

• Use machine layout (nodes) on Blue Waters to reduce communication

• Use consistent timings on Blue Waters to develop accurate performance models

Iterative method for solvingAx = b

while...

↵ hr, zi/hAp, pix x+ ↵p

r+ r � ↵Ap

z+ precond(r)

� hr+, z+i/hr, zip z + �p

CA algorithms, see Eller/Gropp

SpMV2…10…100 SpMVs

p = 0

p = 1

p = 2

p = 3

p = 4

p = 5

• Solid blocks: on-process portion • Patterned blocks: off-process portion (requires communication of the input vector)

Anatomy of a Sparse Matrix-Vector (SpMV) product

w A v

P0

P1

P2

P3

w A ⇤ v

Data layout Where data is sent

• Modeling difficult (more later) • Basic SpMV: rows-per-process layout

Cost of a Sparse Matrix-Vector (SpMV) product

500K 100K 50KNon-zeros per core

0

20

40

60

80

100

%of

Tim

ein

Com

mun

icat

ion

Process ID Process ID

Proc

ess

ID

SpMVAll-reduce SpMV

nlpkkt240

Case Study: Preconditioning (Algebraic Multigrid)

• AMG: Algebraic Multigrid iteratively whittles away at the error

• Series or hierarchy of successively smaller (and more dense) sparse matrices

• SpMV dominated

x x+ !Ar

x x+ !A1r

x x+ !A2r

…

x x+ !Ar

x x+ !A1r

x x+ !A2r

A0

A1

A2

A3

nnz

n rows= 30

nnz

n rows= 64

nnz

n rows= 66

nnz

n rows= 26 Level 0

Level 1

Level 2

Level 3


• MFEM discretization

• Linear elasticity

• 8192 cores, 512 nodes, 10k dof / core

0 5 10 15 20 25Level in AMG Hierarchy

10�4

10�3

10�2

Tim

e(S

econ

ds)

Smaller matrices == more communication

Observation 1: message volume between procs

Maximum number of messages

Maximum size of messages

0 5 10 15 20AMG Level

102

103

Max

Num

ber

ofM

essa

ges

0 5 10 15 20AMG Level

104

105

Max

Mes

sage

sSi

ze(b

ytes

)

1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3. off node > on node > on socket

Observation 2: limits of communication

T = ↵+ppn · s

min (RN , ppn ·RB)

latency message size

Bandwidth between two processes

Node injection bandwidth

Modeling MPI Communication Performance on SMP Nodes: Is it Time to Retire the Ping Pong Test,Gropp, Olson, Samfass, EuroMPI 2016.


Observation 3: node locality

100 101 102 103 104 105 106

Number of Bytes Communicated

10�6

10�5

10�4

Tim

e(s

econ

ds

Network (PPN � 4)

Network (PPN < 4)

On-Node

On-Socket

• Split into short, eager, rendezvous

• Partition into on-socket, on-node, and off-node


Anatomy of a node level SpMV product

P0

P1

P2

P3

P4

P5

N0 N1 N2

Six processes distributed across three nodes

Linear system distributed across the processes

w A v

P0

P1

P2

P3

P4

P5

Standard Communication

n m

q

Node Node

core

Standard Communication

n m

p

Node Node

core

New Algorithm: On-Node Communication

n

p

New Algorithm: Off-Node Communication

n m

p

q

Node-Aware Parallel (NAP) Matrix Operation

1.) Redistribute initial values

n m

p

q

n m

p

q

2.) Inter-node communication

3.) Redistribute received values 4.) On-nodecommunication

5.) Local computationwith on-process, on-node,

and off-node portionsof Matrix

Note: step 4 and portions of step 5 can overlap with steps 1, 2, and 3

n m

p

q

n

p


Off-node On-node

Maximum number of messages sent from any process on 16,384 processes

0 5 10 15 20AMG Level

101

Max

Num

ber

ofO

n-N

ode

Mes

sage

s

ref. SpMV TAPSpMV

0 5 10 15 20AMG Level

101

102

103

Max

O↵-

Nod

eN

umber

ofM

essa

ges

ref. SpMV TAPSpMV


Maximum size of messages sent from any process on 16,384 processes

0 5 10 15 20AMG Level

101

102

103

Max

On-

Nod

eM

essa

ges

Size

(byt

es)

ref. SpMV TAPSpMV

0 5 10 15 20AMG Level

103

104

105

Max

O↵-

Nod

eM

essa

ges

Size

(byt

es)

ref. SpMV TAPSpMV

Off-node On-node


0 5 10 15 20AMG Level

10�3

10�2

Tim

e(s

econ

ds)

ref. SpMV TAPSpMV

0 2000 4000 6000 8000 10000 12000 14000 16000 18000Number of Processes

10�1

Tim

e(s

econ

ds)

ref. SpMV TAPSpMV

Total Time Strong Scaling

Node aware sparse matrix-vector multiplication,Bienz, Gropp, Olson, in review JPDC, 2018. Arxiv

Cost analysis on Blue Waters

• Blue Waters provided a unique setting for two aspects: 1. Model MPI queueing times 2. Model network contention

100 101 102 103 104

Number of Messages Communicated

10�6

10�5

10�4

10�3

10�2

10�1

100

Tim

e(s

econ

ds)

16 Bytes

64 Bytes

256 Bytes

1024 Bytes

4096 Bytes

16384 Bytes

65536 Bytes

262144 Bytes

100 101 102 103 104


10�6

10�5

10�4

10�3

10�2

10�1

100

Tim

e(s

econ

ds)

16 Bytes

64 Bytes

256 Bytes

1024 Bytes

4096 Bytes

16384 Bytes

65536 Bytes

262144 Bytes

• MPI Irecv message queue costly • Identified a quadratic cost



• Network contention is costly • Identified a hop model

100 101 102 103 104


10�5

10�4

10�3

10�2

10�1

100

Tim

e(s

econ

ds)

16 Bytes

64 Bytes

256 Bytes

1024 Bytes

4096 Bytes

16384 Bytes

65536 Bytes

262144 Bytes

G0 G1 G2 G3

100 101 102 103 104


10�5

10�4

10�3

10�2

10�1

100

Tim

e(s

econ

ds)

16 Bytes

64 Bytes

256 Bytes

1024 Bytes

4096 Bytes

16384 Bytes

65536 Bytes

262144 Bytes



0 1 2 3 4 5 6Level in AMG Hierarchy

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

Tim

e(s

econ

ds)

Measured

Max-Rate

Queue Search

Contention

Improving Performance Models for Irregular Point-to-Point Communication, Bienz, Gropp, Olson, in review EuroMPI, 2018.

Summary and Ongoing Work• Drop in replacement for a range of Sparse Matrix operations

(SpMV, SPMM, MIS(k), assembly operations, etc) • Blue Waters instrumental in testing at scale, reproducible

outcomes, and accurate performance analysis. • (this) Code base: https://github.com/lukeolson/raptor

• Structured code base: https://github.com/cedar-framework/cedar

This research is part of the Blue Waters sustained petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.

https://github.com/lukeolson/raptor

https://github.com/cedar-framework/cedar

Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages...

Documents

Transcript of Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages...