Collective Communication...
Transcript of Collective Communication...
![Page 1: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/1.jpg)
Collective Communication
Optimizations
Efficient Shared Memory and RDMA based design for
MPI_Allgather over InfiniBandAmith R. Mamidala, Abhinav Vishnu and
Dhabaleswar K. Panda
Efficient Shared Memory and RDMA based design for
MPI_Allgather over InfiniBandAmith R. Mamidala, Abhinav Vishnu and
Dhabaleswar K. Panda
Scaling Alltoall Collective on Multi-core Systems
Rahul Kumar, Amith Mamidala and Dhabaleswar K. Panda
Designing Multi-Leader-Based Allgather Algorithms for Multi-
Core ClustersKrishna Kandalla, Hari Subramoni, Gopal
Santhnaraman, Mathew Koop and Dhabaleswar K. Panda
Presented By:Md. Wasi-ur-Rahman
Efficient Shared Memory and RDMA based design for
MPI_Allgather over InfiniBandAmith R. Mamidala, Abhinav Vishnu and
Dhabaleswar K. Panda
![Page 2: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/2.jpg)
Efficient Shared Memory andRDMA based design for
MPI_Allgather over InfiniBand
![Page 3: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/3.jpg)
Introduction• Motivated for using in Multi-Core clusters • Recent Advancement in multi-core architecture enabled
higher process density/node• MPI is the most popular programming model for parallel
applications• MPI_Allgather is one of the collective operations in MPI,
which is used extensively• InfiniBand is deployed widely for supporting
communication in large clusters• RDMA has most efficient and scalable performance
features• An efficient MPI_Allgather is highly desirable for all MPI
applications
![Page 4: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/4.jpg)
MPI-Allgather: Recursive Doubling Algorithm
![Page 5: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/5.jpg)
Recursive Algorithm (Contd.)
![Page 6: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/6.jpg)
Recursive Doubling with multiple process/node
![Page 7: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/7.jpg)
Problems with this approach
1.No buffer sharing2.No control over
scheduling3.No overlapping
possible between network communication and data copying
![Page 8: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/8.jpg)
Problem Statement
• In what way extra copy cost can be avoided?
• Is there any overlapping possible between data copying while network operation is taking place?
![Page 9: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/9.jpg)
Algorithm
![Page 10: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/10.jpg)
Performance Evaluation
Two Comparisons1. Comparison between the original
algorithm and the new design2. Comparison between non-overlapping
version of this design and the overlapping version of this design
![Page 11: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/11.jpg)
Experimental Results
![Page 12: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/12.jpg)
Experimental Results (Contd.)
![Page 13: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/13.jpg)
Overlap Benefits
![Page 14: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/14.jpg)
Conclusion & Future Work
• Implemented common buffer for inter and intra node communication
• Incorporated into MVAPICH• Apply to MPI_Allgather algorithms for odd
number of processes• Application level study• Running on higher cores
![Page 15: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/15.jpg)
Scaling Alltoall CollectiveOn Multi-core Systems
![Page 16: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/16.jpg)
Offload Architecture
•The network processing is offloaded to network interface•The NIC is able to send messages relieving the CPU
![Page 17: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/17.jpg)
Onload Architecture
•In onload architecture, CPU is involved in communication in addition to performing the computation•Overlapping between communication and computation is not possible
![Page 18: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/18.jpg)
Bi-directional Bandwidth : InfiniPath (Onload)
![Page 19: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/19.jpg)
Bi-directional Bandwidth : ConnectX
![Page 20: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/20.jpg)
Bi-directional Bandwidth : InfiniHost III (offload)
May be due to the congestion factor at the network interface on using many network architectures
![Page 21: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/21.jpg)
Motivation
Receive side distribution more costly than send side aggregation
![Page 22: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/22.jpg)
Problem Statement
• Can shared memory help to avoid network transactions?
• Can the performance for AlltoAll be improved for multi-core clusters?
![Page 23: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/23.jpg)
Leader Based Algorithm for AlltoAll
![Page 24: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/24.jpg)
AlltoAll Leader based Algorithm
•With two cores per node the number of inter-node communication by each core doubles•Latency is almost doubles with increase in cores/node
![Page 25: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/25.jpg)
Proposed Design
Step 1: Intra-node CommunicationStep 2: AlltoAll Inter-node Communication in each group
![Page 26: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/26.jpg)
Performance Results: AlltoAll (InfiniPath)
![Page 27: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/27.jpg)
InfiniPath with 512 byte message
![Page 28: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/28.jpg)
AlltoAll: InfiniHost III
![Page 29: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/29.jpg)
AlltoAll: ConnectX
![Page 30: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/30.jpg)
CPMD Application
•CPMD uses extensive use of AlltoAll•Proposed Algorithm shows better performance on 128 core system
Proves the scalability as with increased system, new algorithm performs better
![Page 31: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/31.jpg)
Conclusion & Future Work
• Proposed design achieves a reduction in MPI_Alltoall time by 55%
• Speeds up CPMD by 33%• Evaluate in 10GigE system in future• Extend this work to other collectives
![Page 32: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/32.jpg)
Designing Multi-Leader-BasedAllgather Algorithms for
Multi-Core Clusters
![Page 33: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/33.jpg)
Allgather
• Each process broadcasts a vector to each of the other process
• Algorithm used Recursive Doubling (small messages)
tcomm = ts * log(p) + tw * (p -1) * m Ring (large messages)
tcomm = (ts + tw * m) * (p -1)
![Page 34: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/34.jpg)
Scaling on Multi-cores
![Page 35: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/35.jpg)
Problem Statement
• Can there be an algorithm which is multi-core and NUMA aware to achieve better performance and scalability as core counts and system size both increases?
![Page 36: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/36.jpg)
Single Leader – Performance
![Page 37: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/37.jpg)
Proposed Multi-Leader Scheme
![Page 38: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/38.jpg)
Multi-Leader Scheme – Step 1
![Page 39: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/39.jpg)
Multi-Leader Scheme – Step 2
![Page 40: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/40.jpg)
Performance Results
![Page 41: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/41.jpg)
Multi-Leader pt2pt vs shmem
![Page 42: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/42.jpg)
Performance in large scale multi-cores
![Page 43: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/43.jpg)
Proposed Unified Scheme
![Page 44: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/44.jpg)
Conclusion & Future Work• Proposed multi-leader shows improved
scalability and memory contention• Future work should be to devise an algorithm
that can choose the number of leaders in an optimal way in every scenario
• Real world applications• Examine benefits of using kernel based zero-
copy intra-node exchanges for large messages
![Page 45: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication](https://reader034.fdocuments.us/reader034/viewer/2022042418/5f3437333b4adc5dee69541a/html5/thumbnails/45.jpg)
Thank You