Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution...

72
Finding the Happy Medium: Tradeoffs in Communication, Algorithms, Architectures and Programming Models William Gropp www.cs.illinois.edu/~wgropp

Transcript of Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution...

Page 1: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

Finding the Happy Medium: Tradeoffs in Communication, Algorithms,

Architectures and Programming Models

William Gropp www.cs.illinois.edu/~wgropp

Page 2: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

2

Why is this the “Architecture” Talk?

•  Algorithms and software must acknowledge realities of architecture

•  Can encourage architecture changes ♦ But must have compelling case – cost of

trying is high, even with simulation •  Message:

♦ Appropriate performance models can guide development

♦ Avoid unnecessary synchronization • Often encouraged by the programming model

Page 3: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

3

Recommended Reading

•  Bit reversal on uniprocessors (Alan Karp, SIAM Review, 1996)

•  Achieving high sustained performance in an unstructured mesh CFD application (W. K. Anderson, W. D. Gropp, D. K. Kaushik, D. E. Keyes, B. F. Smith, Proceedings of Supercomputing, 1999)

•  Experimental Analysis of Algorithms (Catherine McGeoch, Notices of the American Mathematical Society, March 2001)

•  Reflections on the Memory Wall (Sally McKee, ACM Conference on Computing Frontiers, 2004)

Page 4: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

4

Using Extra Computation in Time Dependent Problems

• Simple example that trades computation for a component of communication time

• Mentioned because ♦ Introduces some costs ♦ Older than MPI

Page 5: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

5

Trading Computation for Communication

•  In explicit methods for time-dependent PDEs, the communication of ghost points can be a significant cost

•  For a simple 2-d problem, the communication time is roughly ♦ T = 4 (s + rn)

(using the “diagonal trick” for 9-point stencils)

♦  Introduces both a communication cost and a synchronization cost (more on that later)

♦ Can we do better?

Page 6: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

6

1-D Time Stepping

Page 7: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

7

1-D Time Stepping

Page 8: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

8

1-D Time Stepping

Page 9: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

9

1-D Time Stepping

Page 10: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

10

1-D Time Stepping

Page 11: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

11

1-D Time Stepping

Page 12: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

12

1-D Time Stepping

Page 13: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

13

Analyzing the Cost of Redundant Computation

•  Advantage of redundant computation: ♦ Communication costs:

•  K steps, 1 step at a time: 2k(s+w) •  K steps at once: 2(s+kw)

♦ Redundant computation is roughly •  Ak2c, for A operations for each eval and time c for

each operation

•  Thus, redundant computation better when ♦ Ak2c < 2(k-1)s

•  Since s on the order of 103c, significant savings are possible

Page 14: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

14

Relationship to Implicit Methods

•  A single time step, for a linear PDE, can be written as ♦  uk+1 = Auk

•  Similarly, ♦  uK+2 = Auk+1 = Aauk = A2uk

♦  And so on

•  Thus, this approach can be used to efficiently compute ♦  Ax, A2x, A3x, …

•  In addition, this approach can provide better temporal locality and has been developed (several times) for cache-based systems

•  Why don’t all applications do this (more later)?

Page 15: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

15

Using Redundant Solvers

•  AMG requires a solve on the coarse grid

•  Rather than either solve in parallel (too little work for the communication) or solve in serial and distribute solution, solve redundantly (either in smaller parallel groups or serial, as in this illustration)

Redundant Solution

At some level, gather the unknowns onto every process. That level andcoarser ones then require no communication:

serial AMG!coarse solve!

all-gather!at level l!

smooth,!form residual!

restrict to!level i+1!

prolong to!level i-1!

smooth!

An analysis17 suggests that this can be of some benefit; we will examinethis further

17W. Gropp, “Parallel Computing and Domain Decomposition,” 1992

Gahvari (University of Illinois) Scaling AMG November 3, 2011 22 / 54

Page 16: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

16

Redundant Solution

• Replace communication at levels ≥lred with Allgather

• Every process now has complete information; no further communication needed

• Performance analysis (based on Gropp & Keyes 1989) can guide selection of lred

Page 17: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

17

Redundant Solves Redundant Solution

When applied to model problem on Hera, there is a speedup region like foradditive AMG:

128 1024 3456

8

7

6

5

4

3

2

Processes

Re

du

nd

an

t L

eve

l

Modeled Speedups for Redundant AMG on Hera

0.13

1.31

1.90

1.40

1.03

1.00

0.02

0.12

1.54

1.60

1.07

1.00

1.00

0.01

0.03

0.31

1.40

1.07

1.01

1.00

0.0

0.5

1.0

1.5

2.0

128 1024 3456

8

7

6

5

4

3

2

Processes

Re

du

nd

an

t L

eve

l

Actual Speedups for Redundant AMG on Hera

1.18

1.62

1.25

1.40

1.42

1.07

0.25

1.27

1.04

0.0

0.5

1.0

1.5

2.0

Diagonal pattern of speedup region, however, still persists. LLNL iscurrently in the process of putting redundant solve/setup in hypre.

Gahvari (University of Illinois) Scaling AMG November 3, 2011 42 / 54

•  Applied to Hera at LLNL, provides significant speedup

•  Thanks to Hormozd Gahvari

Page 18: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

18

Contention and Communication

• The simple model T = s + rn (and slightly more detailed logp models) can give good guidance but ignores some important effects

• One example is regular mesh halo exchange

Page 19: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

19

Mesh Exchange

• Exchange data on a mesh

Page 20: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

20

Nonblocking Receive and Blocking Sends

•  Do i=1,n_neighbors Call MPI_Irecv(inedge(1,i), len, MPI_REAL, nbr(i), tag,& comm, requests(i), ierr) Enddo Do i=1,n_neighbors Call MPI_Send(edge(1,i), len, MPI_REAL, nbr(i), tag,& comm, ierr) Enddo Call MPI_Waitall(n_neighbors, requests, statuses, ierr)

•  Does not perform well in practice (at least on BG, SP). Why?

Page 21: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

21

Understanding the Behavior: Timing Model

• Sends interleave • Sends block (data larger than

buffering will allow) • Sends control timing • Receives do not interfere with

Sends • Exchange can be done in 4 steps

(down, right, up, left)

Page 22: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

22

Mesh Exchange - Step 1

• Exchange data on a mesh

Page 23: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

23

Mesh Exchange - Step 2

• Exchange data on a mesh

Page 24: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

24

Mesh Exchange - Step 3

• Exchange data on a mesh

Page 25: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

25

Mesh Exchange - Step 4

• Exchange data on a mesh

Page 26: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

26

Mesh Exchange - Step 5

• Exchange data on a mesh

Page 27: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

27

Mesh Exchange - Step 6

• Exchange data on a mesh

Page 28: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

28

Timeline from IBM SP

• Note that process 1 finishes last, as predicted

Page 29: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

29

Distribution of Sends

Page 30: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

30

Why Six Steps?

•  Ordering of Sends introduces delays when there is contention at the receiver

•  Takes roughly twice as long as it should •  Bandwidth is being wasted •  Same thing would happen if using memcpy

and shared memory •  The interference of communication is why

adding an MPI_Barrier (normally an unnecessary operation that reduces performance) can occasionally increase performance. But don’t add MPI_Barrier to your code, please :)

Page 31: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

31

Thinking about Broadcasts

•  MPI_Bcast( buf, 100000, MPI_DOUBLE, … ); •  Use a tree-based distribution:

•  Use a pipeline: send the message in b byte pieces. This allows each subtree to begin communication after b bytes sent

•  Improves total performance: ♦  Root process takes same time (asymptotically) ♦  Other processes wait less

•  Time to reach leaf is b log p + (n-b), rather than n log p

•  Special hardware and other algorithms can be used …

Page 32: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

32

Make Full Use of the Network

•  Implement MPI_Bcast(buf,n,…) as MPI_Scatter(buf, n/p,…, buf+rank*n/p,…) MPI_Allgather(buf+rank*n/p, n/p,…,buf,…)

P0 P1 P3 P2 P4 P5 P6 P7

Page 33: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

33

Optimal Algorithm Costs

•  Optimal cost is O(n) (O(p) terms don’t involve n) since scatter moves n data, and allgather also moves only n per process; these can use pipelining to move data as well ♦  Scatter by recursive bisection uses log p steps to

move n(p-1)/p data ♦  Scatter by direct send uses p-1 steps to move

n(p-1)/p data ♦  Recursive doubling allgather uses log p steps to

move •  N/p + 2n/p + 4n/p + … (p/2)/p = n(p-1)/p

♦  Bucket brigade allgather moves •  N/p (p-1) times or (p-1)n/p

•  See, e.g., van de Geijn for more details

Page 34: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

34

Is it communication avoiding or minimum solution time?

• Example: non minimum collective algorithms

• Work of Paul Sack; see “Faster topology-aware collective algorithms through non-minimal communication”, PPoPP 2012

• Lesson: minimum communication need not be optimal

Page 35: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

35

Allgather

1 2 3 4

Input

Output

Page 36: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

36

Allgather: recursive doubling

a b c d

e f g h

i j k l

m n o p

Page 37: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

37

Allgather: recursive doubling

ab ab cd cd

ef ef gh gh

ij ij kl kl

mn mn op op

Page 38: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

38

Allgather: recursive doubling

abcd abcd abcd abcd

efgh efgh efgh efgh

ijkl ijkl ijkl ijkl

mnop mnop mnop mnop

Page 39: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

39

Allgather: recursive doubling

abcdefgh abcdefgh abcdefgh abcdefgh

abcdefgh abcdefgh abcdefgh abcdefgh

ijklmnop ijklmnop ijklmnop ijklmnop

ijklmnop ijklmnop ijklmnop ijklmnop

Page 40: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

40

Allgather: recursive doubling

abcdefgh ijklmnop

abcdefgh ijklmnop

abcdefgh ijklmnop

abcdefgh ijklmnop

abcdefgh ijklmnop

abcdefgh ijklmnop

abcdefgh ijklmnop

abcdefgh ijklmnop

abcdefgh ijklmnop

abcdefgh ijklmnop

abcdefgh ijklmnop

abcdefgh ijklmnop

abcdefgh ijklmnop

abcdefgh ijklmnop

abcdefgh ijklmnop

abcdefgh ijklmnop

T=(lg P) α + n(P-1)β

Page 41: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

41

Problem: Recursive-doubling

• No congestion model: ♦ T=(lgP)α + n(P-1)β

• Congestion on torus: ♦ T≈(lgP)α + (5/24)nP4/3β

• Congestion on Clos network: ♦ T≈(lgP)α + (nP/µ)β

• Solution approach: move smallest

amounts of data the longest distance

Page 42: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

42

Allgather: recursive halving

42

a b c d

e f g h

i j k l

m n o p

Page 43: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

43

Allgather: recursive halving

ac bd ac bd

eg fh eg fh

ik jl ik jl

mo np mo np

Page 44: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

44

Allgather: recursive halving

acik bdjl acik bdjl

egmo fhnp egmo fhnp

acik bdjl acik bdjl

egmo fhnp egmo fhnp

Page 45: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

45

Allgather: recursive halving

acikbdjl acikbdjl acikbdjl acikbdjl

egmofhnp egmofhnp egmofhnp egmofhnp

acikbdjl acikbdjl acikbdjl acikbdjl

egmofhnp egmofhnp egmofhnp egmofhnp

Page 46: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

46

Allgather: recursive halving

acikbdjl egmofhnp

acikbdjl egmofhnp

acikbdjl egmofhnp

acikbdjl egmofhnp

acikbdjl egmofhnp

acikbdjl egmofhnp

acikbdjl egmofhnp

acikbdjl egmofhnp

acikbdjl egmofhnp

acikbdjl egmofhnp

acikbdjl egmofhnp

acikbdjl egmofhnp

acikbdjl egmofhnp

acikbdjl egmofhnp

acikbdjl egmofhnp

acikbdjl egmofhnp

T=(lg P)α + (7/6)nPβ

Page 47: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

47

New problem: data misordered

• Solution: shuffle input data ♦ Could shuffle at end (redundant

work; all processes shuffle) ♦ Could use non-contiguous data

moves ♦ Shuffle data on network…

Page 48: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

48

Solution: Input shuffle

a b c d

e f g h

i j k l

m n o p

Page 49: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

49

a e b f

i m j n

c g d h

k o l p

Solution: Input shuffle

Page 50: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

50

Solution: Input shuffle

a e b f

i m j n

c g d h

k o l p

Page 51: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

51

Solution: Input shuffle

ab ef ab ef

ij mn ij mn

cd gh cd gh

kl op kl op

Page 52: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

52

Solution: Input shuffle

abcdefgh abcdefgh abcdefgh abcdefgh

ijklmnop ijklmnop ijklmnop ijklmnop

abcdefgh abcdefgh abcdefgh abcdefgh

ijklmnop ijklmnop ijklmnop ijklmnop

T=(1+lgP) α + (7/6)nPβ T≈(lgP)α + (7/6)nPβ

Page 53: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

53

Evaluation: Intrepid BlueGene/P at ANL

• 40k-node system ♦ Each is 4 x 850 MHz PowerPC 450

• 512+ nodes is 3d torus; fewer is 3d mesh

• XLC -O4 • 375 MB/s delivered per link

♦ 7% penalty using all 6 links both ways

Page 54: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

54

Allgather performance

Page 55: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

55 55

Allgather performance

Page 56: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

56

Notes on Allgather

• Bucket algorithm (not described here) exploits multiple communication engines on BG

• Analysis shows performance near optimal

• Alternative to reorder data step is in memory move; analysis shows similar performance and measurements show reorder step faster on tested systems

Page 57: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

57

Synchronization and OS Noise

•  “Characterizing the Influence of System Noise on Large-Scale Applications by Simulation,” Torsten Hoefler, Timo Schneider, Andrew Lumsdaine ♦ Best Paper, SC10

• Next 5 slides based on this talk…

Page 58: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

58

A Noisy Example – Dissemination Barrier

• Process 4 is delayed ♦ Noise propagates “wildly” (of course

deterministic)

Page 59: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

59

LogGOPS Simulation Framework

•  Detailed analytical modeling is hard! •  Model-based (LogGOPS) simulator

♦  Available at: http://www.unixer.de/LogGOPSim ♦  Discrete-event simulation of MPI traces (<2% error) or

collective operations (<1% error) ♦  > 106 events per second

•  Allows for trace-based noise injection •  Validation

♦  Simulations reproduce measurements by Beckman and Ferreira well

• Details: Hoefler et al. LogGOPSim – Simulating Large-Scale Applications in the LogGOPS Model (Workshop on Large-Scale System and Application Performance, Best Paper)

Page 60: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

60

Single Collective Operations and Noise

• 1 Byte, Dissemination, regular noise, 1000 Hz, 100 µs

outliers

deterministic Legend:

2nd quartile

3rd quartile median

outliers

Page 61: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

61

Saving Allreduce

•  One common suggestion is to avoid using Allreduce ♦  But algorithms with dot products are among the best

known ♦  Can sometimes aggregate the ate to reduce the

number of separate Allreduce operations ♦  But better is to reduce the impact of the

synchronization by hiding the Allreduce behind other operations (in MPI, using MPI_Iallreduce)

•  We can adapt CG to nonblocking Allreduce with some added floating point (but perhaps little time cost)

Page 62: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

62

The Conjugate Gradient Algorithm

•  While (not converged) niters += 1; s = A * p; t = p' *s; alpha = gmma / t; x = x + alpha * p; r = r - alpha * s; if rnorm2 < tol2 ; break ; end z = M * r; gmmaNew = r' * z; beta = gmmaNew / gmma; gmma = gmmaNew; p = z + beta * p; end

Page 63: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

63

A Nonblocking Version of CG

•  While (not converged) niters += 1; s = Z + beta * s; % Can begin p'*s S = M * s; t = p' *s; alpha = gmma / t; x = x + alpha * p; r = r - alpha * s; % Can move this into the subsequent dot product if rnorm2 < tol2 ; break ; end z = z - alpha * S; % Can begin r'*z here (also begin r'*r for convergence test) Z = A * z; gmmaNew = r' * z; beta = gmmaNew / gmma; gmma = gmmaNew; % Could move x = x + alpha p here to minimize p moves. p = z + beta * p; end

Page 64: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

64

Key Features

• Collective operations overlap significant local computation

• Trades extra local work for overlapped communication ♦ But may not need more memory

loads, so performance cost may be comparable

Page 65: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

65

Performance Analysis

• On a pure floating point basis, the nonblocking version is requires 2 more DAXPY operations

•  A closer analysis shows that some operations can be merged, reducing the amount of memory motion ♦ Same amount of memory motion as

“classic” CG

Page 66: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

66

Processes and SMP nodes

•  HPC users typically believe that their code “owns” all of the cores all of the time ♦  The reality is that was never true, but they did have

all of the cores the same fraction of time when there was one core /node

•  We can use a simple performance model to check the assertion and then use measurements to identify the problem and suggest fixes.

•  Consider a simple Jacobi sweep on a regular mesh, with every core having the same amount of work. How are run times distributed?

Page 67: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

67

Sharing an SMP

•  Having many cores available makes everyone think that they can use them to solve other problems (“no one would use all of them all of the time”)

•  However, compute-bound scientific calculations are often written as if all compute resources are owned by the application

•  Such static scheduling leads to performance loss

•  Pure dynamic scheduling adds overhead, but is better

•  Careful mixed strategies are even better

•  Recent results give 10-16% performance improvements on large, scalable systems

•  Thanks to Vivek Kale, EuroMPI’10

Page 68: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

68

By doing the first portion of work using static scheduling and then doing the remainder of the work using dynamic scheduling, we achieve a consistent performance gain of nearly 8% over the traditional static scheduling method.

Performance of Task Scheduling Strategies (1 node)

Page 69: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

Noise Amplification with an Increasing Number of Nodes

Page 70: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

70

Experiences

•  Paraphrasing either Lincoln or PT Barnum: You own some of the cores all of the time and all of the cores some of the time, but you don’t own all of the cores all of the time

•  Translation: a priori data decompositions that were effective on single core processors are no longer effective on multicore processors

•  We see this in recommendations to “leave one core to the OS”

♦  What about other users of cores, like … the runtime system?

Page 71: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

71

Observations

•  Details of architecture impact performance ♦ Performance models can guide choices but must

have enough (and only enough) detail ♦ These models need only enough accuracy to

guide decisions, they do not need to be predicitive

•  Synchronization is the enemy •  Many techniques have been known for

decades ♦ We should be asking why they aren’t used, and

what role development environments should have

Page 72: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive

72

Some Final Questions

•  Is it communication avoiding or minimum solution time?

•  Is it communication avoiding or latency/communication hiding?

•  Is it synchronization reducing or better load balancing?

•  Is it the programming model, its implementation, or its use?

• How do we answer these questions?