Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution...
Transcript of Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution...
![Page 1: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/1.jpg)
Finding the Happy Medium: Tradeoffs in Communication, Algorithms,
Architectures and Programming Models
William Gropp www.cs.illinois.edu/~wgropp
![Page 2: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/2.jpg)
2
Why is this the “Architecture” Talk?
• Algorithms and software must acknowledge realities of architecture
• Can encourage architecture changes ♦ But must have compelling case – cost of
trying is high, even with simulation • Message:
♦ Appropriate performance models can guide development
♦ Avoid unnecessary synchronization • Often encouraged by the programming model
![Page 3: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/3.jpg)
3
Recommended Reading
• Bit reversal on uniprocessors (Alan Karp, SIAM Review, 1996)
• Achieving high sustained performance in an unstructured mesh CFD application (W. K. Anderson, W. D. Gropp, D. K. Kaushik, D. E. Keyes, B. F. Smith, Proceedings of Supercomputing, 1999)
• Experimental Analysis of Algorithms (Catherine McGeoch, Notices of the American Mathematical Society, March 2001)
• Reflections on the Memory Wall (Sally McKee, ACM Conference on Computing Frontiers, 2004)
![Page 4: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/4.jpg)
4
Using Extra Computation in Time Dependent Problems
• Simple example that trades computation for a component of communication time
• Mentioned because ♦ Introduces some costs ♦ Older than MPI
![Page 5: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/5.jpg)
5
Trading Computation for Communication
• In explicit methods for time-dependent PDEs, the communication of ghost points can be a significant cost
• For a simple 2-d problem, the communication time is roughly ♦ T = 4 (s + rn)
(using the “diagonal trick” for 9-point stencils)
♦ Introduces both a communication cost and a synchronization cost (more on that later)
♦ Can we do better?
![Page 6: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/6.jpg)
6
1-D Time Stepping
![Page 7: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/7.jpg)
7
1-D Time Stepping
![Page 8: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/8.jpg)
8
1-D Time Stepping
![Page 9: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/9.jpg)
9
1-D Time Stepping
![Page 10: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/10.jpg)
10
1-D Time Stepping
![Page 11: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/11.jpg)
11
1-D Time Stepping
![Page 12: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/12.jpg)
12
1-D Time Stepping
![Page 13: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/13.jpg)
13
Analyzing the Cost of Redundant Computation
• Advantage of redundant computation: ♦ Communication costs:
• K steps, 1 step at a time: 2k(s+w) • K steps at once: 2(s+kw)
♦ Redundant computation is roughly • Ak2c, for A operations for each eval and time c for
each operation
• Thus, redundant computation better when ♦ Ak2c < 2(k-1)s
• Since s on the order of 103c, significant savings are possible
![Page 14: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/14.jpg)
14
Relationship to Implicit Methods
• A single time step, for a linear PDE, can be written as ♦ uk+1 = Auk
• Similarly, ♦ uK+2 = Auk+1 = Aauk = A2uk
♦ And so on
• Thus, this approach can be used to efficiently compute ♦ Ax, A2x, A3x, …
• In addition, this approach can provide better temporal locality and has been developed (several times) for cache-based systems
• Why don’t all applications do this (more later)?
![Page 15: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/15.jpg)
15
Using Redundant Solvers
• AMG requires a solve on the coarse grid
• Rather than either solve in parallel (too little work for the communication) or solve in serial and distribute solution, solve redundantly (either in smaller parallel groups or serial, as in this illustration)
Redundant Solution
At some level, gather the unknowns onto every process. That level andcoarser ones then require no communication:
serial AMG!coarse solve!
all-gather!at level l!
smooth,!form residual!
restrict to!level i+1!
prolong to!level i-1!
smooth!
An analysis17 suggests that this can be of some benefit; we will examinethis further
17W. Gropp, “Parallel Computing and Domain Decomposition,” 1992
Gahvari (University of Illinois) Scaling AMG November 3, 2011 22 / 54
![Page 16: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/16.jpg)
16
Redundant Solution
• Replace communication at levels ≥lred with Allgather
• Every process now has complete information; no further communication needed
• Performance analysis (based on Gropp & Keyes 1989) can guide selection of lred
![Page 17: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/17.jpg)
17
Redundant Solves Redundant Solution
When applied to model problem on Hera, there is a speedup region like foradditive AMG:
128 1024 3456
8
7
6
5
4
3
2
Processes
Re
du
nd
an
t L
eve
l
Modeled Speedups for Redundant AMG on Hera
0.13
1.31
1.90
1.40
1.03
1.00
0.02
0.12
1.54
1.60
1.07
1.00
1.00
0.01
0.03
0.31
1.40
1.07
1.01
1.00
0.0
0.5
1.0
1.5
2.0
128 1024 3456
8
7
6
5
4
3
2
Processes
Re
du
nd
an
t L
eve
l
Actual Speedups for Redundant AMG on Hera
1.18
1.62
1.25
1.40
1.42
1.07
0.25
1.27
1.04
0.0
0.5
1.0
1.5
2.0
Diagonal pattern of speedup region, however, still persists. LLNL iscurrently in the process of putting redundant solve/setup in hypre.
Gahvari (University of Illinois) Scaling AMG November 3, 2011 42 / 54
• Applied to Hera at LLNL, provides significant speedup
• Thanks to Hormozd Gahvari
![Page 18: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/18.jpg)
18
Contention and Communication
• The simple model T = s + rn (and slightly more detailed logp models) can give good guidance but ignores some important effects
• One example is regular mesh halo exchange
![Page 19: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/19.jpg)
19
Mesh Exchange
• Exchange data on a mesh
![Page 20: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/20.jpg)
20
Nonblocking Receive and Blocking Sends
• Do i=1,n_neighbors Call MPI_Irecv(inedge(1,i), len, MPI_REAL, nbr(i), tag,& comm, requests(i), ierr) Enddo Do i=1,n_neighbors Call MPI_Send(edge(1,i), len, MPI_REAL, nbr(i), tag,& comm, ierr) Enddo Call MPI_Waitall(n_neighbors, requests, statuses, ierr)
• Does not perform well in practice (at least on BG, SP). Why?
![Page 21: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/21.jpg)
21
Understanding the Behavior: Timing Model
• Sends interleave • Sends block (data larger than
buffering will allow) • Sends control timing • Receives do not interfere with
Sends • Exchange can be done in 4 steps
(down, right, up, left)
![Page 22: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/22.jpg)
22
Mesh Exchange - Step 1
• Exchange data on a mesh
![Page 23: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/23.jpg)
23
Mesh Exchange - Step 2
• Exchange data on a mesh
![Page 24: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/24.jpg)
24
Mesh Exchange - Step 3
• Exchange data on a mesh
![Page 25: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/25.jpg)
25
Mesh Exchange - Step 4
• Exchange data on a mesh
![Page 26: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/26.jpg)
26
Mesh Exchange - Step 5
• Exchange data on a mesh
![Page 27: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/27.jpg)
27
Mesh Exchange - Step 6
• Exchange data on a mesh
![Page 28: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/28.jpg)
28
Timeline from IBM SP
• Note that process 1 finishes last, as predicted
![Page 29: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/29.jpg)
29
Distribution of Sends
![Page 30: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/30.jpg)
30
Why Six Steps?
• Ordering of Sends introduces delays when there is contention at the receiver
• Takes roughly twice as long as it should • Bandwidth is being wasted • Same thing would happen if using memcpy
and shared memory • The interference of communication is why
adding an MPI_Barrier (normally an unnecessary operation that reduces performance) can occasionally increase performance. But don’t add MPI_Barrier to your code, please :)
![Page 31: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/31.jpg)
31
Thinking about Broadcasts
• MPI_Bcast( buf, 100000, MPI_DOUBLE, … ); • Use a tree-based distribution:
• Use a pipeline: send the message in b byte pieces. This allows each subtree to begin communication after b bytes sent
• Improves total performance: ♦ Root process takes same time (asymptotically) ♦ Other processes wait less
• Time to reach leaf is b log p + (n-b), rather than n log p
• Special hardware and other algorithms can be used …
![Page 32: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/32.jpg)
32
Make Full Use of the Network
• Implement MPI_Bcast(buf,n,…) as MPI_Scatter(buf, n/p,…, buf+rank*n/p,…) MPI_Allgather(buf+rank*n/p, n/p,…,buf,…)
P0 P1 P3 P2 P4 P5 P6 P7
![Page 33: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/33.jpg)
33
Optimal Algorithm Costs
• Optimal cost is O(n) (O(p) terms don’t involve n) since scatter moves n data, and allgather also moves only n per process; these can use pipelining to move data as well ♦ Scatter by recursive bisection uses log p steps to
move n(p-1)/p data ♦ Scatter by direct send uses p-1 steps to move
n(p-1)/p data ♦ Recursive doubling allgather uses log p steps to
move • N/p + 2n/p + 4n/p + … (p/2)/p = n(p-1)/p
♦ Bucket brigade allgather moves • N/p (p-1) times or (p-1)n/p
• See, e.g., van de Geijn for more details
![Page 34: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/34.jpg)
34
Is it communication avoiding or minimum solution time?
• Example: non minimum collective algorithms
• Work of Paul Sack; see “Faster topology-aware collective algorithms through non-minimal communication”, PPoPP 2012
• Lesson: minimum communication need not be optimal
![Page 35: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/35.jpg)
35
Allgather
1 2 3 4
Input
Output
![Page 36: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/36.jpg)
36
Allgather: recursive doubling
a b c d
e f g h
i j k l
m n o p
![Page 37: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/37.jpg)
37
Allgather: recursive doubling
ab ab cd cd
ef ef gh gh
ij ij kl kl
mn mn op op
![Page 38: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/38.jpg)
38
Allgather: recursive doubling
abcd abcd abcd abcd
efgh efgh efgh efgh
ijkl ijkl ijkl ijkl
mnop mnop mnop mnop
![Page 39: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/39.jpg)
39
Allgather: recursive doubling
abcdefgh abcdefgh abcdefgh abcdefgh
abcdefgh abcdefgh abcdefgh abcdefgh
ijklmnop ijklmnop ijklmnop ijklmnop
ijklmnop ijklmnop ijklmnop ijklmnop
![Page 40: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/40.jpg)
40
Allgather: recursive doubling
abcdefgh ijklmnop
abcdefgh ijklmnop
abcdefgh ijklmnop
abcdefgh ijklmnop
abcdefgh ijklmnop
abcdefgh ijklmnop
abcdefgh ijklmnop
abcdefgh ijklmnop
abcdefgh ijklmnop
abcdefgh ijklmnop
abcdefgh ijklmnop
abcdefgh ijklmnop
abcdefgh ijklmnop
abcdefgh ijklmnop
abcdefgh ijklmnop
abcdefgh ijklmnop
T=(lg P) α + n(P-1)β
![Page 41: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/41.jpg)
41
Problem: Recursive-doubling
• No congestion model: ♦ T=(lgP)α + n(P-1)β
• Congestion on torus: ♦ T≈(lgP)α + (5/24)nP4/3β
• Congestion on Clos network: ♦ T≈(lgP)α + (nP/µ)β
• Solution approach: move smallest
amounts of data the longest distance
![Page 42: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/42.jpg)
42
Allgather: recursive halving
42
a b c d
e f g h
i j k l
m n o p
![Page 43: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/43.jpg)
43
Allgather: recursive halving
ac bd ac bd
eg fh eg fh
ik jl ik jl
mo np mo np
![Page 44: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/44.jpg)
44
Allgather: recursive halving
acik bdjl acik bdjl
egmo fhnp egmo fhnp
acik bdjl acik bdjl
egmo fhnp egmo fhnp
![Page 45: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/45.jpg)
45
Allgather: recursive halving
acikbdjl acikbdjl acikbdjl acikbdjl
egmofhnp egmofhnp egmofhnp egmofhnp
acikbdjl acikbdjl acikbdjl acikbdjl
egmofhnp egmofhnp egmofhnp egmofhnp
![Page 46: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/46.jpg)
46
Allgather: recursive halving
acikbdjl egmofhnp
acikbdjl egmofhnp
acikbdjl egmofhnp
acikbdjl egmofhnp
acikbdjl egmofhnp
acikbdjl egmofhnp
acikbdjl egmofhnp
acikbdjl egmofhnp
acikbdjl egmofhnp
acikbdjl egmofhnp
acikbdjl egmofhnp
acikbdjl egmofhnp
acikbdjl egmofhnp
acikbdjl egmofhnp
acikbdjl egmofhnp
acikbdjl egmofhnp
T=(lg P)α + (7/6)nPβ
![Page 47: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/47.jpg)
47
New problem: data misordered
• Solution: shuffle input data ♦ Could shuffle at end (redundant
work; all processes shuffle) ♦ Could use non-contiguous data
moves ♦ Shuffle data on network…
![Page 48: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/48.jpg)
48
Solution: Input shuffle
a b c d
e f g h
i j k l
m n o p
![Page 49: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/49.jpg)
49
a e b f
i m j n
c g d h
k o l p
Solution: Input shuffle
![Page 50: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/50.jpg)
50
Solution: Input shuffle
a e b f
i m j n
c g d h
k o l p
![Page 51: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/51.jpg)
51
Solution: Input shuffle
ab ef ab ef
ij mn ij mn
cd gh cd gh
kl op kl op
![Page 52: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/52.jpg)
52
Solution: Input shuffle
abcdefgh abcdefgh abcdefgh abcdefgh
ijklmnop ijklmnop ijklmnop ijklmnop
abcdefgh abcdefgh abcdefgh abcdefgh
ijklmnop ijklmnop ijklmnop ijklmnop
T=(1+lgP) α + (7/6)nPβ T≈(lgP)α + (7/6)nPβ
![Page 53: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/53.jpg)
53
Evaluation: Intrepid BlueGene/P at ANL
• 40k-node system ♦ Each is 4 x 850 MHz PowerPC 450
• 512+ nodes is 3d torus; fewer is 3d mesh
• XLC -O4 • 375 MB/s delivered per link
♦ 7% penalty using all 6 links both ways
![Page 54: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/54.jpg)
54
Allgather performance
![Page 55: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/55.jpg)
55 55
Allgather performance
![Page 56: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/56.jpg)
56
Notes on Allgather
• Bucket algorithm (not described here) exploits multiple communication engines on BG
• Analysis shows performance near optimal
• Alternative to reorder data step is in memory move; analysis shows similar performance and measurements show reorder step faster on tested systems
![Page 57: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/57.jpg)
57
Synchronization and OS Noise
• “Characterizing the Influence of System Noise on Large-Scale Applications by Simulation,” Torsten Hoefler, Timo Schneider, Andrew Lumsdaine ♦ Best Paper, SC10
• Next 5 slides based on this talk…
![Page 58: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/58.jpg)
58
A Noisy Example – Dissemination Barrier
• Process 4 is delayed ♦ Noise propagates “wildly” (of course
deterministic)
![Page 59: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/59.jpg)
59
LogGOPS Simulation Framework
• Detailed analytical modeling is hard! • Model-based (LogGOPS) simulator
♦ Available at: http://www.unixer.de/LogGOPSim ♦ Discrete-event simulation of MPI traces (<2% error) or
collective operations (<1% error) ♦ > 106 events per second
• Allows for trace-based noise injection • Validation
♦ Simulations reproduce measurements by Beckman and Ferreira well
• Details: Hoefler et al. LogGOPSim – Simulating Large-Scale Applications in the LogGOPS Model (Workshop on Large-Scale System and Application Performance, Best Paper)
![Page 60: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/60.jpg)
60
Single Collective Operations and Noise
• 1 Byte, Dissemination, regular noise, 1000 Hz, 100 µs
outliers
deterministic Legend:
2nd quartile
3rd quartile median
outliers
![Page 61: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/61.jpg)
61
Saving Allreduce
• One common suggestion is to avoid using Allreduce ♦ But algorithms with dot products are among the best
known ♦ Can sometimes aggregate the ate to reduce the
number of separate Allreduce operations ♦ But better is to reduce the impact of the
synchronization by hiding the Allreduce behind other operations (in MPI, using MPI_Iallreduce)
• We can adapt CG to nonblocking Allreduce with some added floating point (but perhaps little time cost)
![Page 62: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/62.jpg)
62
The Conjugate Gradient Algorithm
• While (not converged) niters += 1; s = A * p; t = p' *s; alpha = gmma / t; x = x + alpha * p; r = r - alpha * s; if rnorm2 < tol2 ; break ; end z = M * r; gmmaNew = r' * z; beta = gmmaNew / gmma; gmma = gmmaNew; p = z + beta * p; end
![Page 63: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/63.jpg)
63
A Nonblocking Version of CG
• While (not converged) niters += 1; s = Z + beta * s; % Can begin p'*s S = M * s; t = p' *s; alpha = gmma / t; x = x + alpha * p; r = r - alpha * s; % Can move this into the subsequent dot product if rnorm2 < tol2 ; break ; end z = z - alpha * S; % Can begin r'*z here (also begin r'*r for convergence test) Z = A * z; gmmaNew = r' * z; beta = gmmaNew / gmma; gmma = gmmaNew; % Could move x = x + alpha p here to minimize p moves. p = z + beta * p; end
![Page 64: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/64.jpg)
64
Key Features
• Collective operations overlap significant local computation
• Trades extra local work for overlapped communication ♦ But may not need more memory
loads, so performance cost may be comparable
![Page 65: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/65.jpg)
65
Performance Analysis
• On a pure floating point basis, the nonblocking version is requires 2 more DAXPY operations
• A closer analysis shows that some operations can be merged, reducing the amount of memory motion ♦ Same amount of memory motion as
“classic” CG
![Page 66: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/66.jpg)
66
Processes and SMP nodes
• HPC users typically believe that their code “owns” all of the cores all of the time ♦ The reality is that was never true, but they did have
all of the cores the same fraction of time when there was one core /node
• We can use a simple performance model to check the assertion and then use measurements to identify the problem and suggest fixes.
• Consider a simple Jacobi sweep on a regular mesh, with every core having the same amount of work. How are run times distributed?
![Page 67: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/67.jpg)
67
Sharing an SMP
• Having many cores available makes everyone think that they can use them to solve other problems (“no one would use all of them all of the time”)
• However, compute-bound scientific calculations are often written as if all compute resources are owned by the application
• Such static scheduling leads to performance loss
• Pure dynamic scheduling adds overhead, but is better
• Careful mixed strategies are even better
• Recent results give 10-16% performance improvements on large, scalable systems
• Thanks to Vivek Kale, EuroMPI’10
![Page 68: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/68.jpg)
68
By doing the first portion of work using static scheduling and then doing the remainder of the work using dynamic scheduling, we achieve a consistent performance gain of nearly 8% over the traditional static scheduling method.
Performance of Task Scheduling Strategies (1 node)
![Page 69: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/69.jpg)
Noise Amplification with an Increasing Number of Nodes
![Page 70: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/70.jpg)
70
Experiences
• Paraphrasing either Lincoln or PT Barnum: You own some of the cores all of the time and all of the cores some of the time, but you don’t own all of the cores all of the time
• Translation: a priori data decompositions that were effective on single core processors are no longer effective on multicore processors
• We see this in recommendations to “leave one core to the OS”
♦ What about other users of cores, like … the runtime system?
![Page 71: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/71.jpg)
71
Observations
• Details of architecture impact performance ♦ Performance models can guide choices but must
have enough (and only enough) detail ♦ These models need only enough accuracy to
guide decisions, they do not need to be predicitive
• Synchronization is the enemy • Many techniques have been known for
decades ♦ We should be asking why they aren’t used, and
what role development environments should have
![Page 72: Finding the Happy Medium: Tradeoffs in Communication ......17 Redundant Solves Redundant Solution When applied to model problem on Hera, there is a speedup region like for additive](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f34396c3e07dc74c714ac59/html5/thumbnails/72.jpg)
72
Some Final Questions
• Is it communication avoiding or minimum solution time?
• Is it communication avoiding or latency/communication hiding?
• Is it synchronization reducing or better load balancing?
• Is it the programming model, its implementation, or its use?
• How do we answer these questions?