Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab
-
Upload
inga-summers -
Category
Documents
-
view
25 -
download
1
description
Transcript of Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab
![Page 1: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/1.jpg)
A uGNI-Based Asynchronous Message-driven Runtime System for Cray Supercomputers with Gemini Interconnect
Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale
Parallel Programming LabUniversity of Illinois at Urbana-Champaign
Ryan Olson, Cray IncTerry R. Jones, Oak Ridge National Lab
26th IEEE International Parallel & Distributed Processing Symposium
![Page 2: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/2.jpg)
Motivation
Modern interconnects are complex Multiple programming
models/languages are developed
2
![Page 3: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/3.jpg)
Motivation
Modern interconnects are complex Multiple programming
models/languages are developed
How to attain good performance for applications in alternative models on different interconnects ?
3
![Page 4: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/4.jpg)
Motivation
Modern interconnects are complex Multiple programming
models/languages are developed How to attain good performance
for applications in alternative models on different interconnects ?
Charm++ programming model on Gemini Interconnect 4
![Page 5: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/5.jpg)
Outline
Overview of Charm++, Gemini and uGNI
Design of uGNI-based Charm++ Optimizations to improve
communication Micro-benchmark and application
results
5
![Page 6: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/6.jpg)
Charm++ Software Architecture
Charm++ is an object-based over
decomposition programming model
Adaptive intelligent runtime
dynamic load balancing fault tolerance
Scales to 300K cores Portable Run on MPI
![Page 7: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/7.jpg)
Gemini Interconnect
Low latency (700ns) High bandwidth (8GBytes/sec) Scale to 100,000 nodes
7
![Page 8: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/8.jpg)
Gemini Interconnect
Low latency (700ns) High bandwidth (8GBytes/sec) Scale to 100,000 nodes Hardware support for one-sided
communication Fast Memory Access (FMA) Block Transfer Engine (BTE)
8
![Page 9: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/9.jpg)
uGNI
User-level Generic Network Interface Memory Registration/de- Post FMA/BTE transactions Completion Queues
9
![Page 10: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/10.jpg)
Design of uGNI-based Charm++
11
Small messages (less than 1024 bytes)
SMSG directly send with data_tag
![Page 11: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/11.jpg)
Baseline Pingpong Performance
12
![Page 12: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/12.jpg)
Persistent Messages
Communication with fixed pattern Communication processors Data size
Re-use memory Avoid memory allocation Avoid the first handshake message
13
![Page 13: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/13.jpg)
Persistent Messages
Baseline design to transfer data
Transfer persistent messages14
![Page 14: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/14.jpg)
Persistent Messages Performance
15
![Page 15: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/15.jpg)
Memory Pool
Memory registration/de-registration costs a lot
Charm++ controls all memory allocation/de-allocation
16
![Page 16: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/16.jpg)
Memory Pool
Memory registration/de-registration costs a lot
Charm++ controls all memory allocation/de-allocation
Pre-alloc/register big chucks of memory
Allocation/de- is from memory pool
17
![Page 17: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/17.jpg)
Performance of Memory Pool
18
![Page 18: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/18.jpg)
Performance – Message Latency
19
![Page 19: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/19.jpg)
Performance - Bandwidth
20
![Page 20: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/20.jpg)
NQueens (fine-grained)
21
![Page 21: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/21.jpg)
NAMD 100M-atom on Titan
23
32%
70% efficiency
17%
![Page 22: Yanhua Sun , Gengbin Zheng , Laximant(Sanjay ) Kale Parallel Programming Lab](https://reader030.fdocuments.us/reader030/viewer/2022032805/5681331d550346895d99e640/html5/thumbnails/22.jpg)
Conclusion
Gemini Interconnect, Charm++ Optimizations
Persistent messages Memory pool
Micro-benchmark and application results
http://charm.cs.uiuc.edu/software
24