Charm++’...
Transcript of Charm++’...
![Page 1: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/1.jpg)
Charm++ Mo*va*ons and Basic Ideas
Laxmikant (Sanjay) Kale h3p://charm.cs.illinois.edu
Parallel Programming Laboratory Department of Computer Science
University of Illinois at Urbana Champaign
8/6/15 ATPESC 1
![Page 2: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/2.jpg)
Challenges in Parallel Programming • ApplicaNons are geOng more sophisNcated
– AdapNve refinements – MulN-‐scale, mulN-‐module, mulN-‐physics – E.g. Load imbalance emerges as a huge problem for some apps
• Exacerbated by strong scaling needs from apps • Future challenge: hardware variability
– StaNc/dynamic – Heterogeneity: processor types, process variaNon, .. – Power/Temperature/Energy – Component failure
• To deal with these, we must seek – Not full automaNon – Not full burden on app-‐developers – But: a good division of labor between the system and app developers
2 8/6/15 ATPESC
![Page 3: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/3.jpg)
What is Charm++? • Charm++ is a generalized approach to wriNng parallel programs – An alternaNve to the likes of MPI, UPC, GA etc. – But not to sequenNal languages such as C, C++, and Fortran
• Represents: – The style of wriNng parallel programs – The runNme system – And the enNre ecosystem that surrounds it
• Three design principles: – OverdecomposiNon, Migratability, Asynchrony
8/6/15 ATPESC 3
![Page 4: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/4.jpg)
OverdecomposiNon
• Decompose the work units & data units into many more pieces than execuNon units – Cores/Nodes/..
• Not so hard: we do decomposiNon anyway
4 8/6/15 ATPESC
![Page 5: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/5.jpg)
Migratability
• Allow these work and data units to be migratable at runNme – i.e. the programmer or runNme, can move them
• Consequences for the app-‐developer – CommunicaNon must now be addressed to logical units with global names, not to physical processors
– But this is a good thing • Consequences for RTS
– Must keep track of where each unit is – Naming and locaNon management
5 8/6/15 ATPESC
![Page 6: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/6.jpg)
Asynchrony: Message-‐Driven ExecuNon • Now:
– You have mulNple units on each processor – They address each other via logical names
• Need for scheduling: – What sequence should the work units execute in? – One answer: let the programmer sequence them
• Seen in current codes, e.g. some AMR frameworks – Message-‐driven execuNon:
• Let the work-‐unit that happens to have data (“message”) available for it execute next
• Let the RTS select among ready work units • Programmer should not specify what executes next, but can influence it via prioriNes
6 8/6/15 ATPESC
![Page 7: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/7.jpg)
RealizaNon of this model in Charm++
• Overdecomposed enNNes: chares – Chares are C++ objects – With methods designated as “entry” methods
• Which can be invoked asynchronously by remote chares – Chares are organized into indexed collecNons
• Each collecNon may have its own indexing scheme – 1D, ..7D, – Sparse – Bitvector or string as an index
– Chares communicate via asynchronous method invocaNons
• A[i].foo(….); A is the name of a collecNon, i is the index of the parNcular chare.
8/6/15 ATPESC 7
![Page 8: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/8.jpg)
Overdecomposed Objects
AB
C
D
EFG
H
Parallel Address Space
79
64
3
1
0 5
8
2
8/6/15 ATPESC 8
![Page 9: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/9.jpg)
Message-‐driven
8/6/15 ATPESC 9
AB
C
D
EFG
H
Parallel Address Space
E.m1()G.m2()
H.m2()
E.m3()
F.m4()
B.m2()
• Certain member funcNons of certain classes are globally visible
• InvocaNon of a member funcNon may lead to communicaNon
![Page 10: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/10.jpg)
Message-‐driven ExecuNon
Processor 1
Scheduler
Message Queue
Processor 0
Scheduler
Message Queue
A[..].foo(…)
8/6/15 ATPESC 10
![Page 11: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/11.jpg)
Processor 2
Scheduler
Message Queue
Processor 1
Scheduler
Message Queue
Processor 0
Scheduler
Message Queue
Processor 3
Scheduler
Message Queue 8/6/15 ATPESC 11
![Page 12: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/12.jpg)
Processor 2
Scheduler
Message Queue
Processor 1
Scheduler
Message Queue
Processor 0
Scheduler
Message Queue
Processor 3
Scheduler
Message Queue 8/6/15 ATPESC 12
![Page 13: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/13.jpg)
Processor 2
Scheduler
Message Queue
Processor 1
Scheduler
Message Queue
Processor 0
Scheduler
Message Queue
Processor 3
Scheduler
Message Queue 8/6/15 ATPESC 13
![Page 14: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/14.jpg)
Empowering the RTS
• The AdapNve RTS can: – Dynamically balance loads – OpNmize communicaNon:
• Spread over Nme, async collecNves – AutomaNc latency tolerance – Prefetch data with almost perfect predictability
Asynchrony OverdecomposiNon Migratability
AdapNve RunNme System
IntrospecNon AdapNvity
14 8/6/15 ATPESC
![Page 15: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/15.jpg)
message-‐driven execuNon
Migratability
IntrospecNve and adapNve runNme system
Scalable Tools
AutomaNc overlap of CommunicaNon and ComputaNon
EmulaNon for Performance PredicNon
Fault Tolerance
Dynamic load balancing (topology-‐aware, scalable)
Temperature/Power/Energy OpNmizaNons
Benefits in Charm++
Perfect prefetch
composiNonality
Over-‐decomposiNon
15 8/6/15 ATPESC
![Page 16: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/16.jpg)
message-‐driven execuNon
Migratability
IntrospecNve and adapNve runNme system
Scalable Tools
AutomaNc overlap of CommunicaNon and ComputaNon
EmulaNon for Performance PredicNon
Fault Tolerance
Dynamic load balancing (topology-‐aware, scalable)
Temperature/Power/Energy OpNmizaNons
Benefits in Charm++
Perfect prefetch
composiNonality
Over-‐decomposiNon
16 8/6/15 ATPESC
![Page 17: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/17.jpg)
UNlity for MulN-‐cores, Many-‐cores, Accelerators:
• Objects connote and promote locality • Message-‐driven execuNon
– A strong principle of predicNon for data and code use – Much stronger than principle of locality
• Can use to scale memory wall: • Prefetching of needed data:
– into scratch pad memories, for example
8/6/15 ATPESC 17
Processor 1
Scheduler
Message Queue
![Page 18: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/18.jpg)
Impact on communicaNon • Current use of communicaNon network:
– Compute-‐communicate cycles in typical MPI apps – So, the network is used for a fracNon of Nme, – and is on the criNcal path
• So, current communica(on networks are over-‐engineered for by necessity
8/6/15 ATPESC 18
P1
P2
BSP based applicaNon
![Page 19: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/19.jpg)
Impact on communicaNon • With overdecomposiNon
– CommunicaNon is spread over an iteraNon – Also, adapNve overlap of communicaNon and computaNon
8/6/15 ATPESC 19
P1
P2
OverdecomposiNon enables overlap
![Page 20: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/20.jpg)
DecomposiNon Challenges
• Current method is to decompose to processors – But this has many problems
– Deciding which processor does what work in detail is difficult at large scale
• DecomposiNon should be independent of number of processors – enabled by object based decomposiNon
• AdapNve scheduling of the objects on available resources by the RTS
8/6/15 ATPESC 20
![Page 21: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/21.jpg)
DecomposiNon Independent of numCores
• Rocket simulaNon example under tradiNonal MPI
• With migratable-‐objects:
– Benefit: load balance, communicaNon opNmizaNons, modularity
8/6/15 ATPESC
Solid
Fluid
Solid
Fluid
Solid
Fluid . . .
1 2 P
Solid1
Fluid1
Solid2
Fluid2
Solidn
Fluidm . . .
Solid3 . . .
21
![Page 22: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/22.jpg)
ComposiNonality • It is important to support parallel composiNon
– For mulN-‐module, mulN-‐physics, mulN-‐paradigm applicaNons…
• What I mean by parallel composiNon – B || C where B, C are independently developed modules – B is parallel module by itself, and so is C – Programmers who wrote B were unaware of C – No dependency between B and C
• This is not supported well by MPI – Developers support it by breaking abstracNon boundaries
• E.g., wildcard recvs in module A to process messages for module B – Nor by OpenMP implementaNons:
8/6/15 ATPESC 22
![Page 23: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/23.jpg)
8/6/15 ATPESC 23
Without message-‐driven execuNon (and virtualizaNon), you get either: Space-‐division
Time
B
C
![Page 24: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/24.jpg)
8/6/15 ATPESC 24
OR: SequenNalizaNon
Time
B
C
![Page 25: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/25.jpg)
8/6/15 ATPESC 25
Parallel ComposiNon: A1; (B || C ); A2
Recall: Different modules, wri3en in different languages/paradigms, can overlap in Nme and on processors, without programmer having to worry about this explicitly
![Page 26: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/26.jpg)
So, What is Charm++?
• Charm++ is a way of parallel programming based on – Objects – OverdecomposiNon – Message – Asynchrony – Migratability – RunNme system
8/6/15 ATPESC 26
![Page 27: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/27.jpg)
• Charm++ Basics: • Structured Dagger NotaNon • Designing Charm++ programs, with applicaNon case studies
8/6/15 ATPESC 27
![Page 28: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/28.jpg)
Hello World Example
hello.ci file
mainmodule hello {mainchare Main {entry Main(CkArgMsg ∗m);
};};
hello.cpp file
#include <stdio.h>#include ”hello.decl.h”
class Main : public CBase Main {public: Main(CkArgMsg∗ m) {ckout << ”Hello World!” << endl;CkExit();
};};
#include ”hello.def.h”
PPL (UIUC) Parallel Migratable Objects 2 / 71
![Page 29: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/29.jpg)
Hello World with Chares
hello.ci file
mainmodule hello {mainchare Main {entry Main(CkArgMsg ∗m);};chare Singleton {entry Singleton();};
};
hello.cpp file
#include <stdio.h>#include ”hello.decl.h”
class Main : public CBase Main {public: Main(CkArgMsg∗ m) {CProxy Singleton::ckNew();
};};
class Singleton : publicCBase Singleton {
public: Singleton() {ckout << ”Hello World!” << endl;CkExit();
};};#include ”hello.def.h”
PPL (UIUC) Parallel Migratable Objects 3 / 71
![Page 30: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/30.jpg)
Compiling a Charm++ Program
PPL (UIUC) Parallel Migratable Objects 4 / 71
![Page 31: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/31.jpg)
Building Charm++
git clone http://charm.cs.uiuc.edu/gerrit/charm
./build <TARGET> <ARCH> <OPTS>
TARGET = Charm++, AMPI, bgampi, LIBS etc.
ARCH = net-linux-x86 64, multicore-darwin-x86 64,pamilrts-bluegeneq etc.
OPTS = –with-production, –enable-tracing, xlc, smp, -j8 etc.
http://charm.cs.illinois.edu/manuals/html/charm++/A.html
PPL (UIUC) Parallel Migratable Objects 5 / 71
![Page 32: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/32.jpg)
Hello World Example
CompilingI charmc hello.ciI charmc -c hello.CI charmc -o hello hello.o
RunningI ./charmrun +p7 ./helloI The +p7 tells the system to use seven cores
PPL (UIUC) Parallel Migratable Objects 6 / 71
![Page 33: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/33.jpg)
Charm++ File structure
C++ objects (including Charm++ objects)I Defined in regular .h and .C files
Chare objects, entry methods (asynchronous methods)I Defined in .ci fileI Implemented in the .C file
PPL (UIUC) Parallel Migratable Objects 8 / 71
![Page 34: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/34.jpg)
Charm Interface: Modules
Charm++ programs are organized as a collection of modules
Each module has one or more chares
The module that contains the mainchare, is declared as themainmodule
Each module, when compiled, generates two files:MyModule.decl.h and MyModule.def.h
.ci file
[main]module MyModule {//... chare definitions ...
};
PPL (UIUC) Parallel Migratable Objects 9 / 71
![Page 35: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/35.jpg)
Charm Interface: Chares
Chares are parallel objects that are managed by the RTS
Each chare has a set entry methods, which are asynchronous methodsthat may be invoked remotely
The following code, when compiled, generates a C++ classCBase MyChare that encapsulates the RTS object
This generated class is extended and implemented in the .C file
.ci file
[main]chare MyChare {//... entry method definitions ...
};
.C file
class MyChare : public CBase MyChare {//... entry method implementations ...
};
PPL (UIUC) Parallel Migratable Objects 10 / 71
![Page 36: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/36.jpg)
Charm Interface: Entry Methods
Entry methods are C++ methods that can be remotely andasynchronously invoked by another chare
.ci file:
entry MyChare(); /∗ constructor entry method ∗/entry void foo();entry void bar(int param);
.C file:
MyChare::MyChare() { /∗... constructor code ...∗/ }
MyChare::foo() { /∗... code to execute ...∗/ }
MyChare::bar(int param) { /∗... code to execute ...∗/ }
PPL (UIUC) Parallel Migratable Objects 11 / 71
![Page 37: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/37.jpg)
Charm Interface: mainchare
Execution begins with the mainchare’s constructor
The mainchare’s constructor takes a pointer to system-defined classCkArgMsg
CkArgMsg contains argv and argc
The mainchare will typically creates some additional chares
PPL (UIUC) Parallel Migratable Objects 12 / 71
![Page 38: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/38.jpg)
Creating a Chare
A chare declared as chare MyChare {...}; can be instantiated bythe following call:
CProxy MyChare::ckNew(... constructor arguments ...);
To communicate with this class in the future, a proxy to it must beretained
CProxy MyChare proxy =CProxy MyChare::ckNew(... constructor arguments ...);
PPL (UIUC) Parallel Migratable Objects 13 / 71
![Page 39: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/39.jpg)
Chare Proxies
A chare’s own proxy can be obtained through a special variablethisProxy
Chare proxies can also be passed so chares can learn about others
In this snippet, MyChare learns about a chare instance main , andthen invokes a method on it:
.ci file
entry void foobar2(CProxy Main main);
.C file
MyChare::foobar2(CProxy Main main) {main.foo();
}
PPL (UIUC) Parallel Migratable Objects 14 / 71
![Page 40: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/40.jpg)
Charm Termination
There is a special system call CkExit() that terminates the parallelexecution on all processors (but it is called on one processor) andperforms the requisite cleanup
The traditional exit() is insu�cient because it only terminates oneprocess, not the entire parallel job (and will cause a hang)
CkExit() should be called when you can safely terminate theapplication (you may want to synchronize before calling this)
PPL (UIUC) Parallel Migratable Objects 15 / 71
![Page 41: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/41.jpg)
Chare Creation Example: .ci file
mainmodule MyModule {mainchare Main {entry Main(CkArgMsg ∗m);
};
chare Simple {entry Simple(int x, double y);
};};
PPL (UIUC) Parallel Migratable Objects 16 / 71
![Page 42: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/42.jpg)
Chare Creation Example: .C file
#include <stdio.h>#include ”MyModule.decl.h”
class Main : public CBase Main {public: Main(CkArgMsg∗ m) {
ckout << ”Hello World!” << endl;if (m�>argc > 1) ckout << ” Hello ” << m�>argv[1] << ”!!!” << endl;double pi = 3.1415;CProxy Simple::ckNew(12, pi);
};};class Simple : public CBase Simple {public: Simple(int x, double y) {
ckout << ”Hello from a simple chare running on ” << CkMyPe() << endl;ckout << ”Area of a circle of radius” << x << ” is ” << y∗x∗x << endl;CkExit();
}};
#include ”MyModule.def.h”
PPL (UIUC) Parallel Migratable Objects 17 / 71
![Page 43: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/43.jpg)
Asynchronous Methods
Entry methods are invoked by performing a C++ method call on achare’s proxy
CProxy MyChare proxy =CProxy MyChare::ckNew(... constructor arguments ...);
proxy.foo();proxy.bar(5);
The foo and bar methods will then be executed with thearguments, wherever the created chare, MyChare, happens to live
The policy is one-at-a-time scheduling (that is, one entry method onone chare executes on a processor at a time)
PPL (UIUC) Parallel Migratable Objects 18 / 71
![Page 44: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/44.jpg)
Asynchronous Methods
Method invocation is not ordered (between chares, entry methods onone chare, etc.)!
For example, if a chare executes this code:
CProxy MyChare proxy = CProxy MyChare::ckNew();proxy.foo();proxy.bar(5);
These prints may occur in any order
MyChare::foo() {ckout << ”foo executes” << endl;
}
MyChare::bar(int param) {ckout << ”bar executes with ” << param << endl;
}
PPL (UIUC) Parallel Migratable Objects 19 / 71
![Page 45: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/45.jpg)
Asynchronous Methods
For example, if a chare invokes the same entry method twice:
proxy.bar(7);proxy.bar(5);
These may be delivered in any order
MyChare::bar(int param) {ckout << ”bar executes with ” << param << endl;
}
Output
bar executes with 5bar executes with 7
OR
bar executes with 7bar executes with 5
PPL (UIUC) Parallel Migratable Objects 20 / 71
![Page 46: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/46.jpg)
Asynchronous Example: .ci file
mainmodule MyModule {mainchare Main {entry Main(CkArgMsg ∗m);
};chare Simple {entry Simple(double y);entry void findArea(int radius, bool done);
};};
PPL (UIUC) Parallel Migratable Objects 21 / 71
![Page 47: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/47.jpg)
Asynchronous Example: .C file
Does this program execute correctly?
struct Main : public CBase Main {Main(CkArgMsg∗ m) {double pi = 3.1415;CProxy Simple sim = CProxy Simple::ckNew(pi);for (int i = 1; i< 10; i++) sim.findArea(i, false);sim.findArea(10, true);
};};
struct Simple : public CBase Simple {float y;Simple(double pi) {y = pi;ckout << ”Hello from a simple chare running on ” << CkMyPe() << endl;
}void findArea(int r, bool done) {ckout << ”Area of a circle of radius” << r << ” is ” << y∗r∗r << endl;if (done) CkExit();
}}; PPL (UIUC) Parallel Migratable Objects 22 / 71
![Page 48: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/48.jpg)
Data types and entry methods
You can pass basic C++ types to entry methods (int, char, bool,etc.)
C++ STL data structures can be passed by including pup stl.h
Arrays of basic data types can also be passed like this:
.ci file:
entry void foobar(int length, int data[length]);
.C file:
MyChare::foobar(int length, int∗ data) {// ... foobar code ...
}
PPL (UIUC) Parallel Migratable Objects 23 / 71
![Page 49: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/49.jpg)
Collections of Objects: Concepts
Objects can be grouped into indexed collections
Basic examplesI Matrix blockI Chunk of unstructured meshI Portion of distributed data structureI Volume of simulation space
Advanced ExamplesI Abstract portions of computationI Interactions among basic objects or underlying entities
PPL (UIUC) Parallel Migratable Objects 24 / 71
![Page 50: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/50.jpg)
Collections of Objects
Structured: 1D, 2D, . . . , 6D
Unstructured: Anything hashable
Dense
Sparse
Static - all created at once
Dynamic - elements come and go
PPL (UIUC) Parallel Migratable Objects 25 / 71
![Page 51: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/51.jpg)
Collections of Objects
Structured: 1D, 2D, . . . , 6D
Unstructured: Anything hashable
Dense
Sparse
Static - all created at once
Dynamic - elements come and go
PPL (UIUC) Parallel Migratable Objects 25 / 71
![Page 52: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/52.jpg)
Collections of Objects
Structured: 1D, 2D, . . . , 6D
Unstructured: Anything hashable
Dense
Sparse
Static - all created at once
Dynamic - elements come and go
PPL (UIUC) Parallel Migratable Objects 25 / 71
![Page 53: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/53.jpg)
Chare Array: Hello Example
mainmodule arr {
mainchare Main {entry Main(CkArgMsg∗);
}
array [1D] hello {entry hello(int);entry void printHello();
}}
PPL (UIUC) Parallel Migratable Objects 26 / 71
![Page 54: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/54.jpg)
Chare Array: Hello Example
#include ”arr.decl.h”
struct Main : CBase Main {Main(CkArgMsg∗ msg) {int arraySize = atoi(msg�>argv[1]);CProxy hello p = CProxy hello::ckNew(arraySize, arraySize);p[0].printHello();
}};
struct hello : CBase hello {hello(int n) : arraySize(n) { }hello(CkMigrateMessage∗) { }void printHello() {CkPrintf(”PE[%d]: hello from p[%d]\n”, CkMyPe(), thisIndex);if (thisIndex == arraySize � 1) CkExit();else thisProxy[thisIndex + 1].printHello();
}private:int arraySize;
};
#include ”arr.def.h”
PPL (UIUC) Parallel Migratable Objects 27 / 71
![Page 55: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/55.jpg)
Hello World Array Projections Timeline View
Add -tracemode projections to link line to enable tracing
Run Projections tool to load trace log files and visualize performance
arrayHello on BG/Q 16 Nodes, mode c16, 1024 elements (4 per process)
PPL (UIUC) Parallel Migratable Objects 28 / 71
![Page 56: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/56.jpg)
Declaring a Chare Array
.ci file:
array [1d] foo {entry foo(); // constructor
// ... entry methods ...
}array [2d] bar {entry bar(); // constructor
// ... entry methods ...
}
.C file:
struct foo : public CBase foo {foo() { }foo(CkMigrateMessage∗) { }// ... entry methods ...
};struct bar : public CBase bar {bar() { }bar(CkMigrateMessage∗) { }// ... entry methods ...
};PPL (UIUC) Parallel Migratable Objects 29 / 71
![Page 57: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/57.jpg)
Constructing a Chare Array
Constructed much like a regular chare
The size of each dimension is passed to the constructor
void someMethod() {CProxy foo::ckNew(10);CProxy bar::ckNew(5, 5);
}
The proxy may be retained:
CProxy foo myFoo = CProxy foo::ckNew(10);
The proxy represents the entire array, and may be indexed to obtain aproxy to an individual element in the array
myFoo[4].invokeEntry();
PPL (UIUC) Parallel Migratable Objects 30 / 71
![Page 58: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/58.jpg)
thisIndex
1d: thisIndex returns the index of the current chare array element
2d: thisIndex.x and thisIndex.y returns the indices of thecurrent chare array element
.ci file:
array [1d] foo {entry foo();
}
.C file:
struct foo : public CBase foo {foo() {CkPrintf(”array index = %d”, thisIndex);
}};
PPL (UIUC) Parallel Migratable Objects 31 / 71
![Page 59: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/59.jpg)
Collections of Objects: Runtime Service
System knows how to ‘find’ objects e�ciently:(collection, index) ! processor
Applications can specify a mapping, or use simple runtime-providedoptions (e.g. blocked, round-robin)
Distribution can be static, or dynamic!
Key abstraction: application logic doesn’t change, even thoughperformance might
PPL (UIUC) Parallel Migratable Objects 35 / 71
![Page 60: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/60.jpg)
Collections of Objects: Runtime Service
Can develop and test logic in objects separately from their distribution
Separation in time: make it work, then make it fast
Division of labor: domain specialist writes object code,computationalist writes mapping
Portability: di↵erent mappings for di↵erent systems, scales, orconfigurations
Shared progress: improved mapping techniques can benefit existingcode
PPL (UIUC) Parallel Migratable Objects 36 / 71
![Page 61: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/61.jpg)
Collections of Objects
A[1]
A[0]
A[2]
B[3]
B[0]
C[1,0]
C[1,2]
C[0,0]
C[0,2]
C[1,4]
Processor 1 Processor 2
B[3]C[0,0]
C[1,4]
Processor 3 Processor 4
A[1]A[2]
C[0,2]
C[1,0]C[1,2]
A[0]
B[0]
Location ManagerSchedulerLocation ManagerScheduler
PPL (UIUC) Parallel Migratable Objects 37 / 71
![Page 62: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/62.jpg)
Collective Communication Operations
Point-to-point operations involve only two objects
Collective operations that involve a collection of objects
Broadcast: calls a method in each object of the array
Reduction: collects a contribution from each object of the array
A spanning tree is used to send/receive data
A
B C
D E F G
PPL (UIUC) Parallel Migratable Objects 38 / 71
![Page 63: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/63.jpg)
Broadcast
A message to each object in a collection
The chare array proxy object is used to perform a broadcast
It looks like a function call to the proxy object
From the main chare:
CProxy Hello helloArray = CProxy Hello::ckNew(helloArraySize);helloArray.foo();
From a chare array element that is a member of the same array:
thisProxy.foo()
From any chare that has a proxy p to the chare array
p.foo()
PPL (UIUC) Parallel Migratable Objects 39 / 71
![Page 64: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/64.jpg)
Reduction
Combines a set of values: sum, max, aggregate
Usually reduces the set of values to a single value
Combination of values requires an operator
The operator must be commutative and associative
Each object calls contribute in a reduction
PPL (UIUC) Parallel Migratable Objects 40 / 71
![Page 65: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/65.jpg)
Reduction: Example
mainmodule reduction {mainchare Main {entry Main(CkArgMsg∗ msg);entry [reductiontarget] void done(int value);
};array [1D] Elem {entry Elem(CProxy Main mProxy);
};}
PPL (UIUC) Parallel Migratable Objects 41 / 71
![Page 66: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/66.jpg)
Reduction: Example
#include ”reduction.decl.h”
const int numElements = 49;
class Main : public CBase Main {public:Main(CkArgMsg∗ msg) { CProxy Elem::ckNew(thisProxy, numElements); }void done(int value) {CkAssert(value == numElements ∗ (numElements � 1) / 2);CkPrintf(”value: %d\n”, value);CkExit();
}};
class Elem : public CBase Elem {public:Elem(CProxy Main mProxy) {int val = thisIndex;CkCallback cb(CkReductionTarget(Main, done), mProxy);contribute(sizeof(int), &val, CkReduction::sum int, cb);
}Elem(CkMigrateMessage∗) { }
};
#include ”reduction.def.h”
Output:value: 1176Program finished.
PPL (UIUC) Parallel Migratable Objects 42 / 71
![Page 67: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/67.jpg)
Chares are reactive
• The way we described Charm++ so far, a chare is a reactive entity: ! If it gets this method invocation, it does this action, ! If it gets that method invocation then it does that action ! But what does it do? ! In typical programs, chares have a life-cycle
• How to express the life-cycle of a chare in code? ! Only when it exists
* i.e. some chars may be truly reactive, and the programmer does not know the life cycle
! But when it exists, its form is: * Computations depend on remote method invocations, and completion of other local computations
* A DAG (Directed Acyclic Graph)!
1
![Page 68: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/68.jpg)
Structured Dagger (sdag) The when construct
• sdag code is written in the .ci file • It is like a script, with a simple language • Important: The when construct ! Declare the actions to perform when a method invocation is received ! In sequence, it acts like a blocking receive
entry void someMethod() { when entryMethod1(parameters) { block1 } when entryMethod2(parameters) { block2 }
block3 };
2
Implicit Sequencing
![Page 69: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/69.jpg)
Structured Dagger The serial construct
• The serial construct • A sequential block of C++ code in the .ci file • The keyword serial means that the code block will be executed without interruption/preemption
• Syntax: serial <optionalString> {/*C++ code*/ }• The <optionalString> is just a tag for performance analysis • Serial blocks can access all members of the class they belong to
entry void method1(parameters) { when E(a) serial { thisProxy.invokeMethod(10, a); callSomeFunction(); } … };
entry void method2(parameters) { … serial “setValue” { value = 10; } };
3
![Page 70: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/70.jpg)
Structured Dagger The when construct
• Sequentially execute: 1. /* block1 */ 2. Wait for entryMethod1 to arrive, if it has not, return control back
to the Charm++ scheduler, otherwise, execute /* block2 */3. Wait for entryMethod2 to arrive, if it has not, return control back
to the Charm++ scheduler, otherwise, execute /* block3 */
entry void someMethod() { serial { /∗ block1 ∗/ } when entryMethod1(parameters) serial { /∗ block2 ∗/ } when entryMethod2(parameters) serial { /∗ block3 ∗/ } };
4
![Page 71: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/71.jpg)
Structured Dagger The when construct
• You can combine waiting for multiple method invocations • Execute “code-block” when M1 and M2 arrive • You have access to param1, param2, param3 in the code-block
When M1(int param1, int param2), M2(bool param3) { code block }
5
![Page 72: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/72.jpg)
Structured Dagger Boilerplate
• Structured Dagger can be used in any entry method (except for a constructor) • For any class that has Structured Dagger in it you must insert: • The Structured Dagger macro: [ClassName]_SDAG_CODE
6
![Page 73: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/73.jpg)
Structured Dagger Boilerplate
The .ci file: The .cpp file:
[mainchare,chare,array,..] MyFoo { … entry void method(parameters) { // … structured dagger code here … }; … }
class MyFoo : public CBase MyFoo { MyFoo_SDAG_Code/* insert SDAG macro */ public: MyFoo() { } };
7
![Page 74: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/74.jpg)
• The when clause can wait on a certain reference number • If a reference number is specified for a when , the first parameter for the when must be the reference number • Semantics: the when will “block” until a message arrives with that reference number
Structured Dagger The when construct: refnum
when method1[100](int ref, bool param1) /∗ sdag block ∗/ … serial { proxy.method1(200, false); /∗ will not be delivered to the when ∗/ proxy.method1(100, true); /∗ will be delivered to the when ∗/ }
8
![Page 75: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/75.jpg)
Structured Dagger The if-then-else construct
if (thisIndex.x == 10) { when method1[block](int ref, bool someVal) /∗ code block1 ∗/ } else { when method2(int payload) serial { //... some C++ code } }
• The if-then-else construct: ! Same as the typical C if-then-else semantics and syntax
9
![Page 76: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/76.jpg)
Structured Dagger The for construct
for (iter = 0; iter < maxIter; ++iter) { when recvLeft[iter](int num, int len, double data[len]) serial { computeKernel(LEFT, data); } when recvRight[iter](int num, int len, double data[len]) serial { computeKernel(RIGHT, data); } }
• The for construct: ! Defines a sequenced for loop (like a sequential C for loop) ! Once the body for the ith iteration completes, the i + 1 iteration is started
• iter must be defined in the class as a member
class Foo : public CBase Foo { public: int iter; };
10
![Page 77: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/77.jpg)
Structured Dagger The while construct
while (i < numNeighbors) { when recvData(int len, double data[len]) { serial { /∗ do something ∗/ } when method1() /∗ block1 ∗/ when method2() /∗ block2 ∗/ } serial { i++; } }
• The while construct: ! Defines a sequenced while loop (like a sequential C while loop)
11
![Page 78: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/78.jpg)
• The overlap construct: ! By default, Structured Dagger constructs are executed in a sequence ! overlap allows multiple independent constructs to execute in any
order ! Any constructs in the body of an overlap can happen in any order ! An overlap finishes when all the statements in it are executed ! Syntax: overlap { /* sdag constructs */ }
What are the possible execution sequences?
Structured Dagger The overlap construct
serial { /∗ block1 ∗/ } overlap { serial { /∗ block2 ∗/ } when entryMethod1[100](int ref num, bool param1) /∗ block3 ∗/ when entryMethod2(char myChar) /∗ block4 ∗/ } serial { /∗ block5 ∗/ } 12
![Page 79: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/79.jpg)
Illustration of a long “overlap”
• Overlap can be used to regain some asynchrony within a chare • But it is constrained • More disciplined programming, • with fewer race conditions
13
![Page 80: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/80.jpg)
• The forall construct: ! Has “do-all” semantics: iterations may execute an any order ! Syntax: forall [<ident>] (<min> : <max>, <stride>) <body>! The range from <min> to <max> is inclusive
Structured Dagger The forall construct
forall [block] (0 : numBlocks − 1, 1) { when method1[block](int ref, bool someVal) /∗ code block1 ∗/ }
• Assume block is declared in the class as public: int block;
14
![Page 81: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/81.jpg)
5-point Stencil
1-D decomposition: each chare object owns a strip Need to exchange top and bottom boundaries
15
![Page 82: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/82.jpg)
Jacobi: .ci file mainmodule jacobi1d { readonly CProxy Main mainProxy; readonly int blockDimX; readonly int numChares; mainchare Main { entry Main(CkArgMsg ∗m); }; array [1D] Jacobi { entry Jacobi(void); entry void recvGhosts(int iter, int dir, int size, double gh[size]); entry [reducIontarget] void isConverged(bool result); entry void run() { // ... main loop (next slide) ... }; }; };
16
![Page 83: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/83.jpg)
while (!converged) { serial "send_to_neighbors" { iter++; top = (thisIndex+1)%numChares; boUom = …;
thisProxy(top).recvGhosts(iter, BOTTOM, arrayDimY, &value[1][1]); thisProxy(boUom).recvGhosts(iter, TOP, arrayDimY, &value[blockDimX][1]); } for(imsg = 0; imsg < neighbors; imsg++) when recvGhosts[iter] (int iter, int dir, int size, double gh[size])
serial "update_boundary" { int row = (dir == TOP) ? 0 : blockDimX+1; for(int j=0; j<size; j++) value[row][j+1] = gh[j]; } serial "do_work" {
conv = check_and_compute(); // conv: a boolean indica-ng local convergence CkCallback cb = CkCallback(CkReducIonTarget(Jacobi, isConverged), thisProxy); Contribute(sizeof(bool), &conv, CkReducIon::logical_and, cb); }
when isConverged(bool result) serial "check_converge" { converged = result; if (result && thisIndex == 0) CkExit(); } }
17
![Page 84: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/84.jpg)
while (!converged) { serial "send_to_neighbors" { iter++; top = (thisIndex+1)%numChares; boUom = …;
thisProxy(top).recvGhosts(iter, BOTTOM, arrayDimY, &value[1][1]); thisProxy(boUom).recvGhosts(iter, TOP, arrayDimY, &value[blockDimX][1]); } for(imsg = 0; imsg < neighbors; imsg++) when recvGhosts[iter] (int iter, int dir, int size, double gh[size])
serial "update_boundary" { int row = (dir == TOP) ? 0 : blockDimX+1; for(int j=0; j<size; j++) value[row][j+1] = gh[j]; } serial "do_work" {
conv = check_and_compute(); // conv: a boolean indica-ng local convergence CkCallback cb = CkCallback(CkReducIonTarget(Jacobi, isConverged), thisProxy); Contribute(sizeof(bool), &conv, CkReducIon::logical_and, cb); }
when isConverged(bool result) serial "check_converge" { converged = result; if (result && thisIndex == 0) CkExit(); } if (iter % LBPERIOD == 0) {serial "start_lb" { AtSync();} when ResumeFromSync() {}} }
18
![Page 85: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/85.jpg)
Grainsize • Charm++ philosophy: – let the programer decompose their work and data
into coarse-grained entities • It is important to understand what I mean by
coarse-grained entities – You don’t write sequential programs that some
system will auto-decompose – You don’t write programs when there is one
object for each float – You consciously choose a grainsize, BUT choose
it independent of the number of processors • Or parameterize it, so you can tune later
1
![Page 86: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/86.jpg)
2
Crack Propagation
Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle
This is 2D, circa 2002… but shows over-decomposition for unstructured meshes..
![Page 87: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/87.jpg)
Grainsize example: NAMD • High Performing examples: (objects are the
work-data units in Charm++) • On Blue Waters, 100M atom simulation, – 128K cores (4K nodes), 5,510,202 objects
• Edison, Apoa1(92K atoms) – 4K cores , 33124 objects
• Hopper, STMV, 1M atoms, – 15,360 cores, 430,612 objects
3
![Page 88: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/88.jpg)
Grainsize: Weather Forecasting in BRAMS
4
• Brams: Brazillian weather code (based on RAMS) • AMPI version (Eduardo Rodrigues, with Mendes , J. Panetta, ..)
Instead of using 64 work units on 64 cores, used 1024 on 64
![Page 89: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/89.jpg)
5
Working definition of grainsize : amount of computation per remote interaction
Choose grainsize to be just large enough to amortize the overhead
![Page 90: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/90.jpg)
Grainsize in a common setting
6
1
2
4
128M32M8M2M512K64K16K4K
times
tep(
sec)
number of points per chare
Jacobi3D running on JYC using 64 cores on 2 nodes
2048x2048x2048 (total problem size)
2 MB/chare, 256 objects per core
![Page 91: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/91.jpg)
Rules of thumb for grainsize
• Make it as small as possible, as long as it amortizes the overhead
• More specifically, ensure: – Average grainsize is greater than k!v (say 10v) – No single grain should be allowed to be too large
• Must be smaller than T/p, but actually we can express it as – Must be smaller than k!m!v (say 100v)
• Important corollary: – You can be at close to optimal grainsize without
having to think about P, the number of processors
7 7
![Page 92: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/92.jpg)
8
Charm++ Applications as case studies
Only brief overview today
![Page 93: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/93.jpg)
NAMD: Biomolecular Simulations
• Collaboration with K. Schulten
• With over 50,000 registered users
• Scaled to most top US supercomputers
• In production use on supercomputers and clusters and desktops
• Gordon Bell award in 2002
Recent success: Determination of the structure of HIV capsid by researchers including Prof Schulten
9
![Page 94: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/94.jpg)
10
Molecular Dynamics: NAMD • Collection of [charged] atoms
– With bonds – Newtonian mechanics – Thousands to millions atoms
• At each time-step – Calculate forces on each atom
• Bonds • Non-bonded: electrostatic and van
der Waal’s – Short-distance: every timestep – Long-distance: using PME (3D FFT) – Multiple Time Stepping : PME every
4 timesteps – Calculate velocities – Advance positions
Challenge: femtosecond time-step, millions needed!
![Page 95: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/95.jpg)
Hybrid Decomposi9on
11
Object Based Paralleliza9on for MD: Force Decomp. + Spa9al Decomp.
" We have many objects to load balance:
o Each diamond can be assigned to any proc. o Number of diamonds (3D): o 14·∙Number of Cells
![Page 96: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/96.jpg)
Parallelization using Charm++
12
![Page 97: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/97.jpg)
Sturdy design! • This design, – done in 1995 or so, running on 12 node HP cluster
• Has survived – With minor refinements
• Until today – Scaling to 500,000+ cores on Blue Waters! – 300,000 Cores of Jaguar, or BlueGene/P
13
1993
![Page 98: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/98.jpg)
14
Shallow valleys, high peaks, nicely overlapped PME
green: communication
Red: integration Blue/Purple: electrostatics
turquoise: angle/dihedral
Orange: PME
94% efficiency
Apo-A1, on BlueGene/L, 1024 procs
Time intervals on X axis, activity added across processors on Y axis
Projections: Charm++ Performance Analysis Tool
![Page 99: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/99.jpg)
NAMD strong scaling on Titan Cray XK7, Blue Waters Cray XE6, and Mira IBM Blue Gene/Q for 21M and 224M atom benchmarks
0.25
0.5
1
2
4
8
16
32
256 512 1024 2048 4096 8192 16384
Perfo
rman
ce (n
s pe
r day
)
Number of Nodes
NAMD on Petascale Machines (2fs timestep with PME)
21M atoms
224M atoms
Titan XK7Blue Waters XE6
Mira Blue Gene/Q
![Page 100: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/100.jpg)
ChaNGa: Parallel Gravity • Collaborative project
(NSF) – with Tom Quinn, Univ. of
Washington • Gravity, gas dynamics • Barnes-Hut tree codes
– Oct tree is natural decomp – Geometry has better
aspect ratios, so you “open” up fewer nodes
– But is not used because it leads to bad load balance
– Assumption: one-to-one map between sub-trees and PEs
– Binary trees are considered better load balanced
16
With Charm++: Use Oct-Tree, and let Charm++ map subtrees to processors
Evolution of Universe and Galaxy Formation
![Page 101: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/101.jpg)
ChaNGa: Cosmology Simulation
• Tree: Represents particle distribution
• TreePiece: object/chares containing particles
Collaboration with Tom Quinn UW
![Page 102: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/102.jpg)
• Asynchronous, highly overlapped, phases • Requests for remote data overlapped with
local computations
ChaNGa: Optimized Performance
18
![Page 103: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/103.jpg)
ChaNGa : a recent result
19
![Page 104: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/104.jpg)
Episimdemics • Simulation of spread of contagion – Code by Madhav Marathe, Keith Bisset, .. Vtech – Original was in MPI
• Converted to Charm++ – Benefits: asynchronous reductions improved
performance considerably
20
![Page 105: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/105.jpg)
21
Simulating contagion over dynamic networks
EpiSimdemics1
Agent-based
Realistic population data
Intervention2
Co-evolving network,behavior and policy2
transition by interaction
S
I
Local transition
P1
P2
P3
P4
P = 1-exp(t·log(1-I·S)) - t: duration of
co-presence
- I: infectivity
- S: susceptivity
infectious
uninfected
S
I
t
Location Social contact network L1
L2
1C. Barrett et al.,“EpiSimdemics: An Efficient Algorithm for Simulating theSpread of Infectious Disease over Large Realistic Social Networks,” SC082K. Bisset et al., “Modeling Interaction Between Individuals, Social Net-works and Public Policy to Support Public Health Epidemiology,” WSC09.
Virginia Tech Network Dynamics & Simulation Science Lab April 30, 2014 3 / 26
![Page 106: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/106.jpg)
22
Strong scaling performance with the largest data set
0.1
1
10
100
256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K
Sim
ulat
ion
time
per d
ay (s
)
Number of core-modules
Strong Scaling (BlueWaters | XE6)
352K
RR-splitLoc, noBufRR, mbuf
RR-splitLoc, mbuf
0.1
1
10
100
1K 2K 4K 8K 16K 32K 64K 128K
Sim
ulat
ion
time
per d
ay (s
)
Number of cores
Strong Scaling (Vulcan | BG/Q)
RR, mbuf RR, TRAM
RR-splitLoc, mbuf RR-splitLoc, noBufRR-splitLoc, TRAM
0.1
1
10
100
256 512 1K 2K 4K 8K 15K
Sim
ulat
ion
time
per d
ay (s
)
Number of cores
Strong Scaling (Xeon, Infiniband)RR-splitLoc Sierra, TRAM
Cab, TRAMShadowfax, mbuf
Contiguous US population data
XE6: the largest scale (352K cores)
BG/Q: good scaling up to 128K cores
Strong scaling helps timely reaction topandemic
Virginia Tech Network Dynamics & Simulation Science Lab April 30, 2014 26 / 26
![Page 107: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/107.jpg)
OpenAtom Car-Parinello Molecular Dynamics
NSF ITR 2001-2007, IBM, DOE,NSF
23
Molecular Clusters : Nanowires:
Semiconductor Surfaces: 3D-Solids/Liquids:
Recent NSF SSI-SI2 grant With
G. Martyna (IBM) Sohrab Ismail-Beigi
Using Charm++ virtualization, we can efficiently scale small (32 molecule) systems to thousands of processors
![Page 108: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/108.jpg)
Decomposition and Computation Flow
24
![Page 109: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/109.jpg)
Topology Aware Mapping of Objects
25
![Page 110: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/110.jpg)
Improvements by topological aware mapping of computation to processors
26
The simulation of the left panel, maps computational work to processors taking the network connectivity into account while the right panel simulation does not. The “black’’ or idle time processors spent waiting for computational work to arrive on processors is significantly reduced at left. (256waters, 70R, on BG/L 4096 cores)
Punchline: Overdecomposition into Migratable Objects created the degree of freedom needed for flexible mapping
![Page 111: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/111.jpg)
OpenAtom Performance Sampler
27
1
2
4
8
16
32
512 1K 2K 4K 8K 16K
Tim
est
ep (
secs
/ste
p)
No. of cores
OpenAtom running WATER 256M 70Ry on various platforms
Blue Gene/LBlue Gene/P
Cray XT3
Ongoing work on: K-points
![Page 112: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/112.jpg)
Mini-App Features Machine Max cores AMR Overdecomposition,
Custom array index, Message priorities,
Load Balancing, Checkpoint restart
BG/Q 131,072
LeanMD Overdecomposition, Load Balancing,
Checkpoint restart, Power awareness
BG/P BG/Q
131,072 32,768
Barnes-Hut (n-body)
Overdecomposition, Message priorities,
Load Balancing
Blue Waters 16,384
LULESH 2.02 AMPI, Over-decomposition, Load
Balancing
Hopper 8,000
PDES Overdecomposition, Message priorities,
TRAM
Stampede 4,096
MiniApps
28
![Page 113: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/113.jpg)
Mini-App Features Machine Max cores 1D FFT Interoperable with
MPI BG/P BG/Q
65,536 16,384
Random Access TRAM BG/P BG/Q
131,072 16,384
Dense LU SDAG XT5 8,192
Sparse Triangular Solver
SDAG BG/P 512
GTC SDAG BG/Q 1,024
SPH Blue Waters -
More MiniApps
29
![Page 114: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/114.jpg)
30
A recently published book surveys seven major applications developed using Charm++
More info on Charm++: http://charm.cs.illinois.edu Including the miniApps
![Page 115: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/115.jpg)
Where are Exascale Issues? • I didn’t bring up exascale at all so far.. – Overdecomposition, migratability, asynchrony
were needed on yesterday’s machines too – And the app community has been using them – But:
• On *some* of the applications, and maybe without a common general-purpose RTS
• The same concepts help at exascale – Not just help, they are necessary, and adequate – As long as the RTS capabilities are improved
• We have to apply overdecomposition to all (most) apps
31
![Page 116: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/116.jpg)
Relevance to Exascale
32
Intelligent, introspective, Adaptive Runtime Systems, developed for handling application’s dynamic variability, already have features that can deal with challenges posed by exascale hardware
![Page 117: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/117.jpg)
Fault Tolerance in Charm++/AMPI • Four approaches available: – Disk-based checkpoint/restart – In-memory double checkpoint w auto. restart – Proactive object migration – Message-logging: scalable fault tolerance
• Common Features: – Easy checkpoint: migrate-to-disk – Based on dynamic runtime capabilities – Use of object-migration – Can be used in concert with load-balancing
schemes 33
Demo at Tech Marketplace
![Page 118: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/118.jpg)
Saving Cooling Energy • Easy: increase A/C setting
– But: some cores may get too hot • So, reduce frequency if temperature is high (DVFS)
– Independently for each chip • But, this creates a load imbalance! • No problem, we can handle that:
– Migrate objects away from the slowed-down processors – Balance load using an existing strategy – Strategies take speed of processors into account
• Implemented in experimental version – SC 2011 paper, IEEE TC paper
• Several new power/energy-related strategies – PASA ‘12: Exploiting differential sensitivities of code segments
to frequency change
34
Demo at Tech Marketplace
![Page 119: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/119.jpg)
PARM:Power Aware Resource Manager
• Charm++ RTS facilitates malleable jobs • PARM can improve throughput under a fixed
power budget using: – overprovisioning (adding more nodes than
conventional data center) – RAPL (capping power consumption of nodes) – Job malleability and moldability
`"Job"Arrives" Job"Ends/Terminates"
Schedule"Jobs"(LP)"
Update"Queue"
Scheduler"
Launch"Jobs/"ShrinkAExpand"
Ensure"Power"Cap"
ExecuEon"framework"
Triggers"
Profiler"
Strong"Scaling"Power"Aware"Model"
Job"CharacterisEcs"Database"
Power"Aware"Resource"Manager"(PARM)"
35
![Page 120: Charm++’ Mo*va*ons’and’Basic’Ideas’press3.mcs.anl.gov/atpesc/files/2016/02/Kale_2015all.pdf · Charm++’ Mo*va*ons’and’Basic’Ideas’ Laxmikant(Sanjay))Kale) h3p://charm.cs.illinois.edu)](https://reader034.fdocuments.us/reader034/viewer/2022051921/600f034123039e08c748898e/html5/thumbnails/120.jpg)
Summary • Charm++ embodies an adaptive, introspective
runtime system • Many applications have been developed using it
– NAMD, ChaNGa, Episimdemics, OpenAtom, … – Many miniApps, and third-party apps
• Adaptivity developed for apps is useful for addressing exascale challenges – Resilience, power/temperature optimizations, ..
36
More info on Charm++: http://charm.cs.illinois.edu Including the miniApps
Overdecomposition Asynchrony Migratability