Improving communication performance in dense linear...

Post on 31-Oct-2019

13 views 0 download

Transcript of Improving communication performance in dense linear...

Collective communication2.5D algorithms

Modelling exascale

Improving communication performance in denselinear algebra via topology-aware collectives

Edgar Solomonik∗, Abhinav Bhatele†, and James Demmel∗

∗ University of California, Berkeley† Lawrence Livermore National Laboratory

Supercomputing, November 2011

Edgar Solomonik Mapping dense linear algebra 1/ 29

Collective communication2.5D algorithms

Modelling exascale

Outline

Collective communicationRectangular collectives

2.5D algorithms2.5D Matrix Multiplication2.5D LU factorization

Modelling exascaleMulticast performanceMM and LU performance

Edgar Solomonik Mapping dense linear algebra 2/ 29

Collective communication2.5D algorithms

Modelling exascaleRectangular collectives

Performance of multicast (BG/P vs Cray)

128

256

512

1024

2048

4096

8192

8 64 512 4096

Ban

dwid

th (M

B/s

ec)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5

Edgar Solomonik Mapping dense linear algebra 3/ 29

Collective communication2.5D algorithms

Modelling exascaleRectangular collectives

Why the performance discrepancy in multicasts?

I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus

I BG/P uses rectangular multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees

I Route in different dimensional orderI Use both directions of bidirectional network

Edgar Solomonik Mapping dense linear algebra 4/ 29

Collective communication2.5D algorithms

Modelling exascaleRectangular collectives

2D rectangular multicasts trees

root2D 4X4 Torus Spanning tree 1 Spanning tree 2

Spanning tree 3 Spanning tree 4 All 4 trees combined

[Watts and Van De Geijn 95]

Edgar Solomonik Mapping dense linear algebra 5/ 29

Collective communication2.5D algorithms

Modelling exascaleRectangular collectives

Another look at that first plot

How much better are rectangularalgorithms on P = 4096 nodes?

I Binomial collectives on XE6I 1/30th of link

bandwidth

I Rectangular collectives onBG/P

I 4X the link bandwidth

I 120X improvement inefficiency!

How can we apply this?

128

256

512

1024

2048

4096

8192

8 64 512 4096

Ban

dwid

th (M

B/s

ec)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5

Edgar Solomonik Mapping dense linear algebra 6/ 29

Collective communication2.5D algorithms

Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

Matrix multiplication

A

BA

B

A

B

AB

Edgar Solomonik Mapping dense linear algebra 7/ 29

Collective communication2.5D algorithms

Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

2D matrix multiplication

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

16 CPUs (4x4)

[Cannon 69], [Van De Geijn and Watts 97]

Edgar Solomonik Mapping dense linear algebra 8/ 29

Collective communication2.5D algorithms

Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

3D matrix multiplication

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

64 CPUs (4x4x4)

4 copies of matrices

[Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89]

Edgar Solomonik Mapping dense linear algebra 9/ 29

Collective communication2.5D algorithms

Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

2.5D matrix multiplication

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

32 CPUs (4x4x2)

2 copies of matrices

Edgar Solomonik Mapping dense linear algebra 10/ 29

Collective communication2.5D algorithms

Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

Strong scaling matrix multiplication

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

2.5D MM on BG/P (n=65,536)

2.5D SUMMA2D SUMMA

ScaLAPACK PDGEMM

Edgar Solomonik Mapping dense linear algebra 11/ 29

Collective communication2.5D algorithms

Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

2.5D MM on 65,536 cores

0

20

40

60

80

100

8192 131072

Per

cent

age

of m

achi

ne p

eak

n

Matrix multiplication on 16,384 nodes of BG/P

12X faster

2.7X faster

Using 16 matrix copies

2D MM2.5D MM

Edgar Solomonik Mapping dense linear algebra 12/ 29

Collective communication2.5D algorithms

Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

Cost breakdown of MM on 65,536 cores

0

0.2

0.4

0.6

0.8

1

1.2

1.4

n=8192, 2D

n=8192, 2.5D

n=131072, 2D

n=131072, 2.5D

Exe

cutio

n tim

e no

rmal

ized

by

2D

Matrix multiplication on 16,384 nodes of BG/P

95% reduction in comm computationidle

communication

Edgar Solomonik Mapping dense linear algebra 13/ 29

Collective communication2.5D algorithms

Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

2.5D LU factorization

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

(A)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

Edgar Solomonik Mapping dense linear algebra 14/ 29

Collective communication2.5D algorithms

Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

2.5D LU factorization

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

(A)

(B)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

Edgar Solomonik Mapping dense linear algebra 15/ 29

Collective communication2.5D algorithms

Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

2.5D LU factorization

Look at how thisupdate is distributed.

A

BA

B

A

B

AB

Same 3D updatein multiplication

Edgar Solomonik Mapping dense linear algebra 16/ 29

Collective communication2.5D algorithms

Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

2.5D LU factorization

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

(A)

(B)

U

L

(C)(D)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

[Solomonik and Demmel, EuroPar ’11, Distinguished Paper]

Edgar Solomonik Mapping dense linear algebra 17/ 29

Collective communication2.5D algorithms

Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

2.5D LU on 65,536 cores

0

20

40

60

80

100

NO-pivot 2D

NO-pivot 2.5D

CA-pivot 2D

CA-pivot 2.5D

Tim

e (s

ec)

LU on 16,384 nodes of BG/P (n=131,072)

2X faster

2X faster

computeidle

communication

Edgar Solomonik Mapping dense linear algebra 18/ 29

Collective communication2.5D algorithms

Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

Rectangular (RCT) vs binomial (BNM) collectives

0

10

20

30

40

50

60

70

80

MM LU LU+PVT

Per

cent

age

of m

achi

ne p

eak

Binomial vs rectangular collectives on BG/P (n=131,072, p=16,384)

2D BNM2.5D BNM2.5D RCT

Edgar Solomonik Mapping dense linear algebra 19/ 29

Collective communication2.5D algorithms

Modelling exascale

Multicast performanceMM and LU performance

A model for rectangular multicasts

tmcast = m/Bn + 2(d + 1) · o + 3L + d · P1/d · (2o + L)

Our multicast model consists of 3 terms

1. m/Bn, the bandwidth cost

2. 2(d + 1) · o + 3L, the multicast start-up overhead

3. d · P1/d · (2o + L), the path overhead

Edgar Solomonik Mapping dense linear algebra 20/ 29

Collective communication2.5D algorithms

Modelling exascale

Multicast performanceMM and LU performance

A model for binomial multicasts

tbnm = log2(P) · (m/Bn + 2o + L)

I The root of the binomial tree sends log2(P) copies of message

I The setup overhead is overlapped with the path overhead

I We assume no contention

Edgar Solomonik Mapping dense linear algebra 21/ 29

Collective communication2.5D algorithms

Modelling exascale

Multicast performanceMM and LU performance

Model verification: one dimension

200

400

600

800

1000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on a ring of 8 nodes of BG/P

trect modelDCMF rectangle dput

tbnm modelDCMF binomial

Edgar Solomonik Mapping dense linear algebra 22/ 29

Collective communication2.5D algorithms

Modelling exascale

Multicast performanceMM and LU performance

Model verification: two dimensions

0

500

1000

1500

2000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on 64 (8x8) nodes of BG/P

trect modelDCMF rectangle dput

tbnm modelDCMF binomial

Edgar Solomonik Mapping dense linear algebra 23/ 29

Collective communication2.5D algorithms

Modelling exascale

Multicast performanceMM and LU performance

Model verification: three dimensions

0

500

1000

1500

2000

2500

3000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on 512 (8x8x8) nodes of BG/P

trect modelFaraj et al data

DCMF rectangle dputtbnm model

DCMF binomial

Edgar Solomonik Mapping dense linear algebra 24/ 29

Collective communication2.5D algorithms

Modelling exascale

Multicast performanceMM and LU performance

Modelling collectives at exascale (p = 262, 144)

0

20

40

60

80

100

120

140

160

180

200

0.125 1 8 64 512 4096 32768 262144

Ban

dwid

th (G

B/s

ec)

msg size (MB)

Exascale broadcast performance

trect 6Dtrect 5Dtrect 4Dtrect 3Dtrect 2Dtrect 1Dtbnm 3D

Edgar Solomonik Mapping dense linear algebra 25/ 29

Collective communication2.5D algorithms

Modelling exascale

Multicast performanceMM and LU performance

Modelling matrix multiplication at exascale

0.4

0.5

0.6

0.7

0.8

0.9

1

1 8 64

Par

alle

l effi

cien

cy

z dimension of partition

MM strong scaling at exascale (xy plane to full xyz torus)

2.5D with rectangular (c=z)2.5D with binomial (c=z)

2D with binomial

Edgar Solomonik Mapping dense linear algebra 26/ 29

Collective communication2.5D algorithms

Modelling exascale

Multicast performanceMM and LU performance

Modelling LU factorization at exascale

0

0.2

0.4

0.6

0.8

1

1 8 64

Par

alle

l effi

cien

cy

z dimension of partition

LU strong scaling at exascale (xy plane to full xyz torus)

2.5D with rectangular (c=z)2.5D with binomial (c=z)

2D LU with binomial

Edgar Solomonik Mapping dense linear algebra 27/ 29

Collective communication2.5D algorithms

Modelling exascale

Multicast performanceMM and LU performance

Conclusion

I Topology-aware schedulingI Present in IBM BG but not in Cray supercomputersI Avoids network contention/congestionI Enables optimized communication collectivesI Leads to simple communication performance models

I Future workI An automated framework for topology-aware mappingI Tensor computations mappingI Better models for network contention

Edgar Solomonik Mapping dense linear algebra 28/ 29

Collective communication2.5D algorithms

Modelling exascale

Multicast performanceMM and LU performance

Acknowledgements

I Krell CSGF DOE fellowship (DE-AC02-06CH11357)I Resources at Argonne National Lab and Lawrence Berkeley

National LabI DE-SC0003959I DE-SC0004938I DE-FC02-06-ER25786I DE-SC0001845I DE-AC02-06CH11357

I Berkeley ParLab fundingI Microsoft (Award #024263) and Intel (Award #024894)I U.C. Discovery (Award #DIG07-10227)

I Released by Lawrence Livermore National Laboratory asLLNL-PRES-514231

Edgar Solomonik Mapping dense linear algebra 29/ 29

Reductions

Backup slides

Edgar Solomonik Mapping dense linear algebra 30/ 29

Reductions

A model for rectangular reductions

tred = max[m/(8γ), 3m/β,m/Bn]+2(d+1)·o+3L+d ·P1/d ·(2o+L)

I Any multicast tree can be inverted to produce a reduction treeI The reduction operator must be applied at each node

I each node operates on 2m dataI both the memory bandwidth and computation cost can be

overlapped

Edgar Solomonik Mapping dense linear algebra 31/ 29

Reductions

Rectangular reduction performance on BG/P

50

100

150

200

250

300

350

400

8 64 512

Ban

dwid

th (M

B/s

ec)

#nodes

DCMF Reduce peak bandwidth (largest message size)

torus rectangle ringtorus binomial

torus rectangletorus short binomial

BG/P rectangular reduction performs significantly worse thanmulticast

Edgar Solomonik Mapping dense linear algebra 32/ 29

Reductions

Performance of custom line reduction

100

200

300

400

500

600

700

800

512 4096 32768

Ban

dwid

th (M

B/s

ec)

msg size (KB)

Performance of custom Reduce/Multicast on 8 nodes

MPI BroadcastCustom Ring Multicast

Custom Ring Reduce 2Custom Ring Reduce 1

MPI Reduce

Edgar Solomonik Mapping dense linear algebra 33/ 29

Reductions

A new LU latency lower bound

k₁

k₀

k₂

k₃

k₄

k

A₀₀

A₂₂

A₃₃

A₄₄

A

n

n

critical path

d-1,d-1d-1

A₁₁

flops lower bound requires d = Ω(√p) blocks/messages

bandwidth lower bound required d = Ω(√cp) blocks/messages

Edgar Solomonik Mapping dense linear algebra 34/ 29

Reductions

2.5D LU strong scaling (without pivoting)

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

LU without pivoting on BG/P (n=65,536)

ideal scaling2.5D LU

2D LU

Edgar Solomonik Mapping dense linear algebra 35/ 29

Reductions

Strong scaling of 2.5D LU with tournament pivoting

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

LU with tournament pivoting on BG/P (n=65,536)

ideal scaling2.5D LU

2D LUScaLAPACK PDGETRF

Edgar Solomonik Mapping dense linear algebra 36/ 29