Improving communication performance in dense linear...

Collective communication2.5D algorithms

Modelling exascale

Improving communication performance in denselinear algebra via topology-aware collectives

Edgar Solomonik∗, Abhinav Bhatele†, and James Demmel∗

∗ University of California, Berkeley† Lawrence Livermore National Laboratory

Supercomputing, November 2011

Edgar Solomonik Mapping dense linear algebra 1/ 29

Modelling exascale

Outline

Collective communicationRectangular collectives

2.5D algorithms2.5D Matrix Multiplication2.5D LU factorization

Modelling exascaleMulticast performanceMM and LU performance

Modelling exascaleRectangular collectives

Performance of multicast (BG/P vs Cray)

8 64 512 4096

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5

Why the performance discrepancy in multicasts?

I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus

I BG/P uses rectangular multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees

I Route in different dimensional orderI Use both directions of bidirectional network

2D rectangular multicasts trees

root2D 4X4 Torus Spanning tree 1 Spanning tree 2

Spanning tree 3 Spanning tree 4 All 4 trees combined

[Watts and Van De Geijn 95]

Another look at that first plot

How much better are rectangularalgorithms on P = 4096 nodes?

I Binomial collectives on XE6I 1/30th of link

bandwidth

I Rectangular collectives onBG/P

I 4X the link bandwidth

I 120X improvement inefficiency!

How can we apply this?

8 64 512 4096

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5

Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

Matrix multiplication

Modelling exascale

2D matrix multiplication

CPU CPU CPU CPU

16 CPUs (4x4)

[Cannon 69], [Van De Geijn and Watts 97]

Modelling exascale

3D matrix multiplication

CPU CPU CPU CPU

64 CPUs (4x4x4)

4 copies of matrices

[Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89]

Modelling exascale

2.5D matrix multiplication

CPU CPU CPU CPU

32 CPUs (4x4x2)

2 copies of matrices

Modelling exascale

Strong scaling matrix multiplication

256 512 1024 2048

#nodes

2.5D MM on BG/P (n=65,536)

2.5D SUMMA2D SUMMA

ScaLAPACK PDGEMM

Modelling exascale

2.5D MM on 65,536 cores

8192 131072

Matrix multiplication on 16,384 nodes of BG/P

12X faster

2.7X faster

Using 16 matrix copies

2D MM2.5D MM

Modelling exascale

Cost breakdown of MM on 65,536 cores

n=8192, 2D

n=8192, 2.5D

n=131072, 2D

n=131072, 2.5D

Matrix multiplication on 16,384 nodes of BG/P

95% reduction in comm computationidle

communication

Modelling exascale

2.5D LU factorization

L₀₀

U₀₀

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

U₀₀

L₀₀

U₀₀

L₀₀

Modelling exascale

L₀₀

U₀₀

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

U₀₀

L₀₀

U₀₀

L₀₀

Modelling exascale

Look at how thisupdate is distributed.

Same 3D updatein multiplication

Modelling exascale

L₀₀

U₀₀

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

(C)(D)

U₀₀

L₀₀

U₀₀

L₀₀

[Solomonik and Demmel, EuroPar ’11, Distinguished Paper]

Modelling exascale

2.5D LU on 65,536 cores

NO-pivot 2D

NO-pivot 2.5D

CA-pivot 2D

CA-pivot 2.5D

LU on 16,384 nodes of BG/P (n=131,072)

2X faster

computeidle

communication

Modelling exascale

Rectangular (RCT) vs binomial (BNM) collectives

MM LU LU+PVT

Binomial vs rectangular collectives on BG/P (n=131,072, p=16,384)

2D BNM2.5D BNM2.5D RCT

Modelling exascale

Multicast performanceMM and LU performance

A model for rectangular multicasts

tmcast = m/Bn + 2(d + 1) · o + 3L + d · P1/d · (2o + L)

Our multicast model consists of 3 terms

1. m/Bn, the bandwidth cost

2. 2(d + 1) · o + 3L, the multicast start-up overhead

3. d · P1/d · (2o + L), the path overhead

Modelling exascale

A model for binomial multicasts

tbnm = log2(P) · (m/Bn + 2o + L)

I The root of the binomial tree sends log2(P) copies of message

I The setup overhead is overlapped with the path overhead

I We assume no contention

Modelling exascale

Model verification: one dimension

1 8 64 512 4096 32768 262144

msg size (KB)

DCMF Broadcast on a ring of 8 nodes of BG/P

trect modelDCMF rectangle dput

tbnm modelDCMF binomial

Modelling exascale

Model verification: two dimensions

1 8 64 512 4096 32768 262144

msg size (KB)

DCMF Broadcast on 64 (8x8) nodes of BG/P

trect modelDCMF rectangle dput

tbnm modelDCMF binomial

Modelling exascale

Model verification: three dimensions

1 8 64 512 4096 32768 262144

msg size (KB)

DCMF Broadcast on 512 (8x8x8) nodes of BG/P

trect modelFaraj et al data

DCMF rectangle dputtbnm model

DCMF binomial

Modelling exascale

Modelling collectives at exascale (p = 262, 144)

0.125 1 8 64 512 4096 32768 262144

msg size (MB)

Exascale broadcast performance

trect 6Dtrect 5Dtrect 4Dtrect 3Dtrect 2Dtrect 1Dtbnm 3D

Modelling exascale

Modelling matrix multiplication at exascale

1 8 64

l effi

z dimension of partition

MM strong scaling at exascale (xy plane to full xyz torus)

2.5D with rectangular (c=z)2.5D with binomial (c=z)

2D with binomial

Modelling exascale

Modelling LU factorization at exascale

1 8 64

l effi

z dimension of partition

LU strong scaling at exascale (xy plane to full xyz torus)

2.5D with rectangular (c=z)2.5D with binomial (c=z)

2D LU with binomial

Modelling exascale

Conclusion

I Topology-aware schedulingI Present in IBM BG but not in Cray supercomputersI Avoids network contention/congestionI Enables optimized communication collectivesI Leads to simple communication performance models

I Future workI An automated framework for topology-aware mappingI Tensor computations mappingI Better models for network contention

Modelling exascale

Acknowledgements

I Krell CSGF DOE fellowship (DE-AC02-06CH11357)I Resources at Argonne National Lab and Lawrence Berkeley

National LabI DE-SC0003959I DE-SC0004938I DE-FC02-06-ER25786I DE-SC0001845I DE-AC02-06CH11357

I Berkeley ParLab fundingI Microsoft (Award #024263) and Intel (Award #024894)I U.C. Discovery (Award #DIG07-10227)

I Released by Lawrence Livermore National Laboratory asLLNL-PRES-514231

Reductions

Backup slides

Reductions

A model for rectangular reductions

tred = max[m/(8γ), 3m/β,m/Bn]+2(d+1)·o+3L+d ·P1/d ·(2o+L)

I Any multicast tree can be inverted to produce a reduction treeI The reduction operator must be applied at each node

I each node operates on 2m dataI both the memory bandwidth and computation cost can be

overlapped

Reductions

Rectangular reduction performance on BG/P

8 64 512

#nodes

DCMF Reduce peak bandwidth (largest message size)

torus rectangle ringtorus binomial

torus rectangletorus short binomial

BG/P rectangular reduction performs significantly worse thanmulticast

Reductions

Performance of custom line reduction

512 4096 32768

msg size (KB)

Performance of custom Reduce/Multicast on 8 nodes

MPI BroadcastCustom Ring Multicast

Custom Ring Reduce 2Custom Ring Reduce 1

MPI Reduce

Reductions

A new LU latency lower bound

A₀₀

A₂₂

A₃₃

A₄₄

critical path

d-1,d-1d-1

A₁₁

flops lower bound requires d = Ω(√p) blocks/messages

bandwidth lower bound required d = Ω(√cp) blocks/messages

Reductions

2.5D LU strong scaling (without pivoting)

256 512 1024 2048

#nodes

LU without pivoting on BG/P (n=65,536)

ideal scaling2.5D LU

Reductions

Strong scaling of 2.5D LU with tournament pivoting

256 512 1024 2048

#nodes

LU with tournament pivoting on BG/P (n=65,536)

ideal scaling2.5D LU

2D LUScaLAPACK PDGETRF

Improving communication performance in dense linear...

Documents

Transcript of Improving communication performance in dense linear...

Improving Query Representations for Dense Retrieval with ...

Small Dense

Finding Dense Subgraphs

國立臺灣大學htlin/course/dsa13spring/doc/0305_arr… · construction and maintenance? Lin Arrays . Dense Array dense implementation of the abstract array int dense[10] o dense

Dense Medium Cyclone

Multimi Dense

Material Datasheet DENSE AND MEDIUM DENSE SOLID … · With the CEMEX Dense and Medium Dense Coursing Bricks, you can be assured of getting the best range of masonry solutions designed

Improving Scalability of CMPs with Dense ACCs Coverage

DENSE CONNECTIVE TISSUE

ADIPOSE. Areolar RETICULAR DENSE REGULAR DENSE IRREGULAR.

Loose, Dense, Specialized

PRO-DENSE - eMedia · 87SR-0410 PRO-DENSE Injectable Graft 10CC 87SR-0420 PRO-DENSE Injectable Graft 20CC 87SR-CK15 PRO-DENSE CDK 15CC Mechanical properties at 13 and ...

Nutrient Dense Grazing

Dense in Itself

3 Breakup Dense

NutsImproveDietQualityComparedtoOtherEnergy-Dense ...

Dense Social Megastructure

Dense Matrix Algorithms

Dense Connective

Improving performance of sparse matrix dense matrix ...