MPI Sessions: a proposal to the MPI Forum
-
Upload
jeff-squyres -
Category
Technology
-
view
1.524 -
download
1
Transcript of MPI Sessions: a proposal to the MPI Forum
How to make MPI Awesome:MPI Sessions
Follow-on to Jeff’s crazy thoughts discussed in Bordeaux
Random group of people who have been talking about this stuff:Wesley Bland, Ryan Grant, Dan Holmes, Kathryn Mohror,
Martin Schulz, Anthony Skjellum, Jeff Squyres^
more
What we want• Any thread (e.g., library) can use MPI any time it wants• But still be able to totally clean up MPI if/when desired• New parameters to initialize the MPI API
MPI Process// Library 1MPI_Init(…);
// Library 2MPI_Init(…);
// Library 3MPI_Init(…);
// Library 4MPI_Init(…);
// Library 5MPI_Init(…);
// Library 6MPI_Init(…);// Library 7
MPI_Init(…);
// Library 8MPI_Init(…);
// Library 9MPI_Init(…);
// Library 10MPI_Init(…);
// Library 11MPI_Init(…);
// Library 12MPI_Init(…);
Before MPI-3.1, this could be erroneous
int my_thread1_main(void *context) { MPI_Initialized(&flag); // …}
int my_thread2_main(void *context) { MPI_Initialized(&flag); // …}
int main(int argc, char **argv) { MPI_Init_thread(…, MPI_THREAD_FUNNELED, …); pthread_create(…, my_thread1_main, NULL); pthread_create(…, my_thread2_main, NULL); // …}
These mightrun at the same time (!)
The MPI-3.1 solution
• MPI_INITIALIZED (and friends) are allowed to be called at any time– …even by multiple threads– …regardless of MPI_THREAD_* level
• This is a simple, easy-to-explain solution– And probably what most applications do, anyway
• But many other paths were investigated
MPI-3.1 MPI_INIT / FINALIZE limitations
• Cannot init MPI from different entities within a process without a priori knowledge / coordination– I.e.: MPI-3.1 (intentionally) still did not solve the underlying problem
MPI Process// Library 1 (thread)MPI_Initialized(&flag);if (!flag) MPI_Init(…);
// Library 2 (thread)MPI_Initialized(&flag);if (!flag) MPI_Init(…);
THIS IS INSUFFICIENT / POTENTIALLY ERRONEOUS
(More of) What we want
• Fix MPI-3.1 limitations:– Cannot init MPI from different entities within a
process without a priori knowledge / coordination– Cannot initialize MPI more than once– Cannot set error behavior of MPI initialization– Cannot re-initialize MPI after it has been finalized
All these things overlap
Still be able to finalize MPI
Any thread can use MPI any time
Re-initialize MPIAffect MPI
initialization error behavior
How do we get those things?
KEEPCALM
AND
LISTEN TOTHE ENTIREPROPOSAL
New concept: “session”
• A local handle to the MPI library– Implementation intent: lightweight / uses very few
resources– Can also cache some local state
• Can have multiple sessions in an MPI process– MPI_Session_init(…, &session);– MPI_Session_finalize(…, &session);
MPI Session
MPI Process
ocean library
MPI_SESSION_INIT(…)
atmosphere library
MPI_SESSION_INIT(…)
MPI library
MPI Session
MPI Process
ocean library atmosphere library
MPI library
ocean session
atmos-phere
session
Unique handles to the underlying MPI library
Initialize / finalize a session
• MPI_Session_init(– IN MPI_Info info,– IN MPI_Errhandler errhandler,– OUT MPI_Session *session)
• MPI_Session_finalize(– INOUT MPI_Session *session)
• Parameters described in next slides…
Session init params
• Info: For future expansion• Errhandler: to be invoked if
MPI_SESSION_INIT errors– Likely need a new type of errhandler• …or a generic errhandler• FT working is discussing exactly this topic
MPI Session
MPI Process
ocean library atmosphere library
MPI library
oceanErrors return
atmos-phereErrors abort
Unique errhandlers, info, local state, etc.
Great. I have a session.Now what?
Fair warning
• The MPI runtime has long-since been a bastard stepchild– Barely acknowledged in
the standard– Mainly in the form of
non-normative suggestions
• It’s time to change that
Overview
• General scheme:– Query the underlying
run-time system• Get a “set” of processes
– Determine the processes you want• Create an MPI_Group
– Create a communicator with just those processes• Create an MPI_Comm
Query runtimefor set of processes
MPI_Group
MPI_Comm
MPI_Session
Runtime concepts
• Expose 2 concepts to MPI from the runtime:1. Static sets of processes2. Each set caches (key,value) string tuples
These slides only discuss static sets(unchanged for the life of the process).
However, there are several useful scenarios that involve dynamic membership of sets over time. More
discussion needs to occur for these scenarios.
For the purposes of these slides,just consider static sets.
Static sets of processes
• Sets are identified by string name• Two sets are mandated– “mpi://WORLD”– “mpi://SELF”
• Other sets can be defined by the system:– “location://rack/19”– “network://leaf-switch/37”– “arch://x86_64”– “job://12942”– … etc.
• Processes can be in more than one set
These names are implementation-
dependent
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
mpi://WORLD
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
mpi://WORLD
arch://x86_64
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
mpi://WORLD
job://12942
arch://x86_64
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
mpi://SELF mpi://SELF mpi://SELF mpi://SELF
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
location://rack/self location://rack/self
location://rack/17 location://rack/23
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
user://ocean user://atmosphere
mpiexec \ --np 2 --set user://ocean ocean.exe : \ --np 2 --set user://atmosphere atmosphere.exe
Querying the run-time
• MPI_Session_get_names(– IN MPI_Session session,– OUT char **set_names)
• Returns argv-style list of \0-terminated names– Must be freed by caller
Example list of set names returnedmpi://WORLD
mpi://SELF
arch://x86_64
location://rack/17
job://12942
user://ocean
Values in sets
• Each set has an associated MPI_Info object• One mandated key in each info:– “size”: number of processes in this set
• Runtime may also provide other keys– Implementation-dependent
Querying the run-time
• MPI_Session_get_info(– IN MPI_Session session,– IN const char *set_name,– OUT MPI_Info *info)
• Use existing MPI_Info functions to retrieve (key,value) tuples
ExampleMPI_Info info;MPI_Session_get_info(session, “mpi://WORLD”, &info);
char *size_str[MPI_MAX_INFO_VAL]MPI_Info_get(info, “size”, …, size_str, …);int size = atoi(size_str);
Ummmm… great.What’s the point of that?
Make MPI_Groups!
• MPI_Group_create_from_session(– IN MPI_Session session,– IN const char *set_name,– OUT MPI_Group *group);
Advice to implementers:
This MPI_Group can still be a lightweight object (even if there are
a large number of processes in it)
Example// Make a group of procs from “location://rack/self”
MPI_Create_group_from_session_name(session, “location://rack/self”,
&group);
// Use just the even procsMPI_Group_size(group, &size);ranges[0][0] = 0;ranges[0][1] = size;ranges[0][2] = 2;MPI_Group_range_incl(group, 1, ranges,
&group_of_evens);
Make a communicator from that group
• MPI_Create_comm_from_group(– IN MPI_Group group,– IN const char *tag, // for matching (see next slide)– IN MPI_Info info,– IN MPI_Errhandler errhander,– OUT MPI_Comm *comm)
Note: this is different than the existing function
MPI_Comm_create_group(oldcomm, group, (int) tag,
&newcomm)
Might need a better name for this new function…?
String tag is used to match concurrent creations by different entities
MPI Process
ocean library atmosphere library
MPI Process
ocean library atmosphere library
MPI Process
ocean library atmosphere library
MPI_Create_comm_from_group(…, tag = “gov.anl.ocean”, …)
MPI_Create_comm_from_group(.., tag = “gov.llnl.atmosphere”, …)
Make any kind of communicator
• MPI_Create_cart_comm_from_group(– IN MPI_Group group,– IN const char *tag,– IN MPI_Info info,– IN MPI_Errhandler errhander,– IN int ndims,– IN const int dims[],– IN const int periods[],– IN int reorder,– OUT MPI_Comm *comm)
Make any kind of communicator
• MPI_Create_graph_comm_from_group(…)• MPI_Create_dist_graph_comm_from_group(…)• MPI_Create_dist_graph_adjacent_comm_from
_group(…)
Run-time static sets across different sessions in the same process
• Making communicators from the same static set will always result in the same local rank– Even if created from different sessions
See example in the next slide…
Run-time static sets across different sessions in the same process
// Session, group, and communicator 1MPI_Create_group_from_session_name(session_1, “mpi://WORLD”, &group1);MPI_Create_comm_from_group(group1, “ocean”, …, &comm1);MPI_Comm_rank(comm1, &rank1);
// Session, group, and communicator 2MPI_Create_group_from_session_name(session_2, “mpi://WORLD”, &group2);MPI_Create_comm_from_group(group2, “atmosphere”, …, &comm2);MPI_Comm_rank(comm2, &rank2);
// Ranks are guaranteed to be the sameassert(rank1 == rank2);
Law of Least Astonishment
Mixing requests from different sessions: disallowed
// Session, group, and communicator 1MPI_Create_group_from_session_name(session_1,
“mpi://WORLD”, &group1);MPI_Create_comm_from_group(group1, “ocean”, …, &comm1);MPI_Isend(…, &req[0]);
// Session, group, and communicator 2MPI_Create_group_from_session_name(session_2,
“mpi://WORLD”, &group2);MPI_Create_comm_from_group(group2, “atmosphere”, …, &comm2);MPI_Isend(…, &req[1]);
// Mixing requests from different// sessions is disallowedMPI_Waitall(2, req, …);
Rationale: this is difficult to optimize, particularly if a session
maps to hardware resources
MPI_Session_finalize
• Analogous to MPI_FINALIZE– Can block waiting for the destruction of the
objects derived from that session• Communicators, Windows, Files, … etc.
– Each session that is initialized must be finalized
Well, that all sounds great.
…but who calls MPI_INIT?
And what session does MPI_COMM_WORLD / MPI_COMM_SELF belong to?
New concept: no longer require MPI_INIT / MPI_FINALIZE
New concept: no longer require MPI_INIT / MPI_FINALIZE
• WHAT?!• When will MPI initialize itself?• How will MPI finalize itself?– It is still (very) desirable to allow MPI to clean itself
up so that MPI processes can be “valgrind clean” when they exit
Split MPI APIs into two setsPerformance doesn’t
matter (as much)
• Functions that create / query / destroy:– MPI_Comm– MPI_File– MPI_Win– MPI_Info– MPI_Op– MPI_Errhandler– MPI_Datatype– MPI_Group– MPI_Session– Attributes– Processes
• MPI_T
Performanceabsolutely matters
• Point to point• Collectives• I/O• RMA• Test/Wait• Handle language xfer
Split MPI APIs into two setsPerformance doesn’t
matter (as much)
• Functions that create / query / destroy:– MPI_Comm– MPI_File– MPI_Win– MPI_Info– MPI_Op– MPI_Errhandler– MPI_Datatype– MPI_Group– MPI_Session– Attributes– Processes
• MPI_T
Performanceabsolutely matters
• Point to point• Collectives• I/O• RMA• Test/Wait• Handle language xfer
Ensure that MPI is initialized (and/or finalized) by these
functions
These functions still can’t be used unless MPI is
initialized
Split MPI APIs into two setsPerformance doesn’t
matter (as much)
• Functions that create / query / destroy:– MPI_Comm– MPI_File– MPI_Win– MPI_Info– MPI_Op– MPI_Errhandler– MPI_Datatype– MPI_Group– MPI_Session– Attributes– Processes
• MPI_T
Performance absolutely matters
• Point to point• Collectives• I/O• RMA• Test/Wait• Handle language xfer
These functions init / finalize MPI transparently
These functions can’t be called without a handle created from
the left-hand column
Split MPI APIs into two setsPerformance doesn’t
matter (as much)
• Functions that create / query / destroy:– MPI_Comm– MPI_File– MPI_Win– MPI_Info– MPI_Op– MPI_Errhandler– MPI_Datatype– MPI_Group– MPI_Session– Attributes– Processes
• MPI_T
Performance absolutely matters
• Point to point• Collectives• I/O• RMA• Test/Wait• Handle language xfer
MPI_COMM_WORLD and MPI_COMM_SELF are notable
exceptions.
…I’ll address this shortly.
Exampleint main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype);
The creation of the first user-defined MPI object initializes MPI
Initialization can be a local action!
Exampleint main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype); // Free the datatype – finalizes MPI MPI_Type_free(&mytype); // Valgrind clean return 0;}
The destruction of the last user-defined MPI object finalizes /
cleans up MPI. This is guaranteed.
There are some corner cases
described on the following slides.
Exampleint main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype); // Free the datatype – finalizes MPI MPI_Type_free(&mytype);
// Re-initialize MPI! MPI_Type_dup(MPI_INT, &mytype);
We can also re-initialize MPI!(it’s transparent to the user – so why not?)
Exampleint main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype); // Free the datatype – finalizes MPI MPI_Type_free(&mytype);
// Re-initialize MPI! MPI_Type_dup(MPI_INT, &mytype); return 0;}
(Sometimes) Not an error to exit the process with MPI still initialized
The overall theme
• Just use MPI functions whenever you want– MPI will initialize as it needs to– Initialization essentially becomes an
implementation detail• Finalization will occur whenever all user-
defined handles are destroyed
Wait a minute –What about MPI_COMM_WORLD?
int main() { // Can’t I do this? MPI_Send(…, MPI_COMM_WORLD);
This would be calling a “performance matters”
function before a “performance doesn’t
matter” function
I.e., MPI has not initialized yet
Wait a minute –What about MPI_COMM_WORLD?
int main() { // This is valid MPI_Init(NULL, NULL); MPI_Send(…, MPI_COMM_WORLD);
Re-define MPI_INIT and MPI_FINALIZE:constructor and destructor for
MPI_COMM_WORLD and MPI_COMM_SELF
INIT and FINALIZEint main() { MPI_Init(NULL, NULL); MPI_Send(…, MPI_COMM_WORLD); MPI_Finalize();}
INIT and FINALIZE continue to exist for two reasons:1. Backwards compatibility2. Convenience
So let’s keep them as close to MPI-3.1 as possible:• If you call INIT, you have to call FINALIZE• You can only call INIT / FINALIZE once
• INITIALIZED / FINALIZED only refer to INIT / FINALIZE (not sessions)
If you want different behavior, use sessions
INIT and FINALIZE
• INIT/FINALIZE create an implicit session– You cannot extract an MPI_Session handle for the
implicit session created by MPI_INIT[_THREAD]• Yes, you can use INIT/FINALIZE in the same
MPI process as other sessions
Backwards compatibility:INITIALIZED and FINALIZED behavior
int main() { MPI_Initialized(&flag); assert(flag == false); MPI_Finalized(&flag); assert(flag == false);
MPI_Session_create(…, &session1); MPI_Initialized(&flag); assert(flag == false); MPI_Finalized(&flag); assert(flag == false);
MPI_Init(NULL, NULL); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == false);
MPI_Session_free(…, &session1); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == false);
MPI_Session_create(…, &session2); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == false);
MPI_Finalize(); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == true);
MPI_Session_free(…, &session2); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == true);}
Short version:
INITIALIZED, FINALIZED,
IS_THREAD_MAIN all still refer to INIT / FINALIZE
FIN
(for the main part of the proposal)
Items that still need more discussion
Issues that still need more discussion
• Dynamic runtime sets– Temporal– Membership
• Covered in other proposals:– Thread concurrent vs. non-concurrent– Generic error handlers
Issues that still need more discussion
• If COMM_WORLD|SELF are not available by default:– Do we need new destruction hooks to replace SELF
attribute callbacks on FINALIZE?– What is the default error handler behavior for functions
without comm/file/win?• Do we need syntactic sugar to get a comm from
mpi://WORLD?• How do tools hook into MPI initialization and
finalization?
Session queries
• Query session handle equality– MPI_Session_query(handle1, handle1_type,
handle2, handle2_type, bool *are_they_equal)– Not 100% sure we need this…?
Session thread support
• Associate thread level support with sessions• Three options:
1. Similar to MPI-3.1: “first” initialization picks thread level
2. Let each session pick its own thread level (via info key in SESSION_CREATE)
3. Just make MPI always be THREAD_MULTIPLE