Ernie Chan
-
Upload
lareina-giles -
Category
Documents
-
view
31 -
download
2
description
Transcript of Ernie Chan
![Page 1: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/1.jpg)
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links
Ernie Chan
![Page 2: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/2.jpg)
Authors
Ernie Chan Robert van de Geijn
Department of Computer Sciences
The University of Texas at Austin
William Gropp Rajeev Thakur
Mathematics and Computer Science Division
Argonne National Laboratory
![Page 3: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/3.jpg)
Testbed Architecture
IBM Blue Gene/L3D torus point-to-point interconnect networkOne rack
1024 dual-processor nodes Two 8 x 8 x 8 midplanes
Special feature to send simultaneously Use multiple calls to MPI_Isend
![Page 4: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/4.jpg)
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
![Page 5: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/5.jpg)
Model of Parallel Computation
Target Architectures Distributed-memory parallel architectures
Indexingp computational nodes Indexed 0 … p - 1
Logically Fully ConnectedA node can send directly to any other node
![Page 6: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/6.jpg)
Model of Parallel Computation
TopologyN-dimensional torus
5
9 11
3
7
8
0
10
12 13 15
1
4
14
6
2
![Page 7: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/7.jpg)
Model of Parallel Computation
Old Model of Communicating Between NodesUnidirectional sending or receiving
![Page 8: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/8.jpg)
Model of Parallel Computation
Old Model of Communicating Between NodesSimultaneous sending and receiving
![Page 9: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/9.jpg)
Model of Parallel Computation
Old Model of Communicating Between NodesBidirectional exchange
![Page 10: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/10.jpg)
Model of Parallel Computation
Communicating Between NodesA node can send or receive with 2N other
nodes simultaneously along its 2N different links
![Page 11: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/11.jpg)
Model of Parallel Computation
Communicating Between NodesCannot perform bidirectional exchange on any
link while sending or receiving simultaneously with multiple nodes
![Page 12: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/12.jpg)
Model of Parallel Computation
Cost of Communication
α + nβ
α: startup time, latencyn: number of bytes to communicateβ: per data transmission time, bandwidth
![Page 13: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/13.jpg)
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
![Page 14: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/14.jpg)
Sending Simultaneously
Old Cost of Communication with Sends to Multiple NodesCost to send to m separate nodes
(α + nβ) m
![Page 15: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/15.jpg)
Sending Simultaneously
New Cost of Communication with Simultaneous Sends
(α + nβ) m
can be replaced with
(α + nβ) + (α + nβ) (m - 1)
![Page 16: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/16.jpg)
Sending Simultaneously
New Cost of Communication with Simultaneous Sends
(α + nβ) m
can be replaced with
(α + nβ) + (α + nβ) (m - 1) τ
Cost of one send Cost of extra sends
![Page 17: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/17.jpg)
Sending Simultaneously
New Cost of Communication with Simultaneous Sends
(α + nβ) m
can be replaced with
(α + nβ) + (α + nβ) (m - 1) τ
Cost of one send Cost of extra sends
0 ≤ τ ≤ 1
![Page 18: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/18.jpg)
Sending Simultaneously
Benchmarking Sending SimultaneouslyLogarithmic-Logarithmic timing graphsMidplane – 512 nodesSending simultaneously with 1 – 6 neighbors8 bytes – 4 MB
![Page 19: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/19.jpg)
Sending Simultaneously
![Page 20: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/20.jpg)
Sending Simultaneously
Cost of Communication with Simultaneous Sends
(α + nβ) (1 + (m - 1) τ)
![Page 21: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/21.jpg)
Sending Simultaneously
![Page 22: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/22.jpg)
Sending Simultaneously
![Page 23: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/23.jpg)
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
![Page 24: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/24.jpg)
Collective Communication
Broadcast (Bcast)Motivating example
Before After
![Page 25: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/25.jpg)
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
![Page 26: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/26.jpg)
Generalized Algorithms
Short-Vector AlgorithmsMinimum-Spanning Tree
Long-Vector AlgorithmsBucket Algorithm
![Page 27: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/27.jpg)
Generalized Algorithms
Minimum-Spanning Tree
![Page 28: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/28.jpg)
Generalized Algorithms
Minimum-Spanning TreeDivide p nodes into N+1 partitions
![Page 29: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/29.jpg)
Generalized Algorithms
Minimum-Spanning TreeDisjointed partitions on N-dimensional mesh
5
9 11
3
7
8
0
10
12 13 15
1
4
14
6
2
![Page 30: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/30.jpg)
Generalized Algorithms
Minimum-Spanning TreeDivide dimensions by a decrementing counter
from N+1
5
9 11
3
7
8
0
10
12 13 15
1
4
14
6
2
![Page 31: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/31.jpg)
Generalized Algorithms
Minimum-Spanning TreeNow divide into 2N+1 partitions
5
9 11
3
7
8
0
10
12 13 15
1
4
14
6
2
![Page 32: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/32.jpg)
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
![Page 33: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/33.jpg)
Performance Results
Single point-to-pointcommunication
![Page 34: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/34.jpg)
Performance Results
my-bcast-MST
![Page 35: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/35.jpg)
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
![Page 36: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/36.jpg)
Conclusion
IBM Blue Gene/L supports functionality of sending simultaneouslyBenchmarking along with model checking
verifies this claim New generalized algorithms show clear
performance gains
![Page 37: Ernie Chan](https://reader036.fdocuments.us/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/37.jpg)
Conclusion
Future DirectionsRoom for optimization to reduce
implementation overheadWhat if not using MPI_COMM_WORLD?Possible new algorithm for Bucket Algorithm
Questions? [email protected]