Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based...

30
Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J. Dongarra and Shirley V. Browne. iversity of Tennessee and Oak Ridge National Labora * contact at [email protected]

Transcript of Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based...

Page 1: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Managing Heterogeneous MPI Application Interoperation and

Execution.From PVMPI to SNIPE based

MPI_Connect()

Graham E. Fagg*, Kevin S. London, Jack J. Dongarra and Shirley V. Browne.

University of Tennessee and Oak Ridge National Laboratory

* contact at [email protected]

Page 2: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Project Targets

• Allow intercommunication between different MPI implementations or instances of the same implementation on different machines

• Provide heterogeneous intercommunicating MPI applications with access to some of the MPI-2 process control and parallel I/O features

• Allow use of optimized vendor MPI implementations while still permitting distributed heterogeneous parallel computing in a transparent manner.

Page 3: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

MPI-1 Communicators

• All processes in an MPI-1 application belong to a global communicator called MPI_COMM_WORLD

• All other communicators are derived from this global communicator.

• Communication can only occur within a communicator.– Safe communication

Page 4: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

MPI InternalsProcesses

• All process groups are derived from the membership of the MPI_COMM_WORLD communicator.

– i.e.. no external processes.

• MPI-1 process membership is static not dynamic like PVM or LAM 6.X

– simplified consistency reasoning

– fast communication (fixed addressing) even across complex topologies.

– interfaces well to simple run-time systems as found on many MPPs.

Page 5: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

MPI-1 Application

MPI_COMM_WORLD

Derived_Communicator

Page 6: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Disadvantages of MPI-1

• Static process model– If a process fails, all communicators it belongs to

become invalid. I.e. No fault tolerance.

– Dynamic resources either cause applications to fail due to loss of nodes or make applications inefficient as they cannot take advantage of new nodes by starting/spawning additional processes.

– When using a dedicated MPP MPI implementation you cannot usually use off-machine or even off- partion nodes.

Page 7: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

MPI-2• Problem areas and needed additional features identified in

MPI-1 are being addressed by the MPI-2 forum.• These include:

– inter-language operation– dynamic process control / management– parallel IO– extended collective operations

• Support for inter-implementation communication/control was considered ?? – See other projects such as NIST IMPI, PACX and PLUS

Page 8: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

User requirements for Inter-Operation

• Dynamic connections between tasks on different platforms/systems/sites

• Transparent inter-operation once it has started

• Single style of API, i.e. MPI only!• Access to virtual machine / resource management

when required not mandatory use• Support for complex features across

implementations such as user derived data types etc.

Page 9: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

MPI_Connect/PVMPI 2 • API in the MPI-1 style

• Inter-application communication using standard point to point MPI-1 functions calls

– Supporting all variations of send and receive

– All MPI data types including user derived

• Naming Service functions similar to the semantics and functionality of current MPI inter-communicator functions

– ease of use for programmers only experienced in MPI and not other message passing systems such as PVM / Nexus or TCP socket programming!

Page 10: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Process Identification• Process groups are identified by a character name.

Individual processes are address by a set of tuples.• In MPI:

– {communicator, rank}– {process group, rank}

• In MPI_Connect/PVMPI2:– {name, rank}– {name, instance}

• Instance and rank are identical and range from 0..N-1 where N is the number of processes.

Page 11: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Registration I.e. naming MPI applications

• Process groups register their name with a global naming service which returns them a system handle used for future operations on this name / communicator pair.– int MPI_Conn_register (char *name, MPI_Comm local_comm, int * handle);

– call MPI_CONN_REGISTER (name, local_comm, handle, ierr)

• Processes can remove their name from the naming service with:– int MPI_Conn_remove (int handle);

• A process may have multiple names associated with it.• Names can be registered, removed and reregistered multiple

times without restriction.

Page 12: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

MPI-1 Inter-communicators

• An inter-communicator is used for point to point communication between disjoint groups of processes.

• Inter-communicators are formed using MPI_Intercomm_create() which operates upon two existing non-overlapping intra-communicator and a bridge communicator.

• MPI_Connect/PVMPI could not use this mechanism as there is not a MPI bridge communicator between groups formed from separate MPI applications as their MPI_COMM_WORLDs do not overlap.

Page 13: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Forming Inter-communicators

• MPI_Connect/PVMPI2 forms its inter-communicators with a modified MPI_Intercomm_create call.

• The bridging communication is performed automatically and the user only has to specify the remote groups

registered name.– int MPI_Conn_intercomm_create (int local_handle, char *remote_group_name, MPI_Comm *new_inter_comm);

– Call MPI_CONN_INTERCOMM_CREATE (localhandle, remotename, newcomm, ierr)

Page 14: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Inter-communicators

• Once an inter-communicator has been formed it can be used almost exactly as any other MPI inter-communicator– All point to point operations– Communicator comparisons and duplication– Remote group information

• Resources released by MPI_Comm_free()

Page 15: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Simple exampleAir Model

/* air model */MPI_Init (&argc, &argv);MPI_Conn_register (“AirModel”, MPI_COMM_WORLD,

&air_handle);MPI_Conn_intercomm_create ( handle, “OceanModel”,

&ocean_comm);MPI_Comm_rank( MPI_COMM_WORLD, &myrank);while (!done) { /* do work using intra-comms */ /* swap values with other model */ MPI_Send( databuf, cnt, MPI_DOUBLE, myrank, tag,

ocean_comm); MPI_Recv( databuf, cnt, MPI_DOUBLE, myrank, tag,

ocean_comm, status);} /* end while done work */MPI_Conn_remove ( air_handle );MPI_Comm_free ( &ocean_comm );MPI_Finalize();

Page 16: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Ocean model

/* ocean model */MPI_Init (&argc, &argv);MPI_Conn_register (“OceanModel”, MPI_COMM_WORLD, &ocean_handle);MPI_Conn_intercomm_create ( handle, “AirModel”, &air_comm);MPI_Comm_rank( MPI_COMM_WORLD, &myrank);while (!done) { /* do work using intra-comms */ /* swap values with other model */ MPI_Recv( databuf, cnt, MPI_DOUBLE, myrank, tag, air_comm, &status); MPI_Send( databuf, cnt, MPI_DOUBLE, myrank, tag, air_comm);}MPI_Conn_remove ( ocean_handle );MPI_Comm_free ( &air_comm );MPI_Finalize();

Page 17: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Coupled model

MPI Application Ocean Model MPI Application Air Model

MPI_COMM_WORLDMPI_COMM_WORLD

Global inter-communicator

air_comm ->

<- ocean_comm

Page 18: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

MPI_Connect InternalsI.e. SNIPE

• SNIPE is a meta-computing system from UTK that was designed to support long-term distributed applications.

• Uses SNIPE as a communications layer.

• Naming services is provided by the RCDS RC_Server system (which is also used by HARNESS for repository information).

• MPI application startup is via SNIPE daemons that interface to standard batch/queuing systems

– These understand LAM, MPICH, MPIF (POE), SGI MPI, squb variations

• soon condor and lsf

Page 19: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

PVMPI2 InternalsI.e. PVM

• Uses PVM as a communications layer• Naming services is provided by the PVM Group

Server in PVM3.3.x and by the Mailbox system in PVM 3.4.

• PVM 3.4 is simpler as it has user controlled message contexts and message handlers.

• MPI application startup is provided by specialist PVM tasker processes.– These understand LAM, MPICH, IBM MPI (POE), SGI MPI

Page 20: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Internalslinking with MPI

• MPI_Connect and PVMPI are built as a MPI profiling interface. Thus they are transparent to user applications.

• During building it can be configured to call other profiling interfaces and hence allow inter-operation with other MPI monitoring/tracing/debugging tool sets.

Page 21: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

MPI_Connect / PVMPI Layering

MPI_function

Users CodeIntercomm Library

Return code

Look up communicators etc

If true MPI intracommthen use profiled MPI call

PMPI_Function

Else translate into SNIPE/PVMaddressing and use SNIPE/PVMfunctions other libraryWork out correct returncode

Page 22: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Process Management• PVM and/or SNIPE can handle the startup of MPI

jobs– General Resource Manager / Specialist PVM Tasker

control / SNIPE daemons

• Jobs can also be started by MPIRUN– useful when testing on interactive nodes

• Once enrolled, MPI processes are under SNIPE or PVM control– signals (such as TERM/HUP)– notification messages (fault tolerance)

Page 23: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Process ManagementMPICH Cluster IBM SP2 SGI O2K

MPICH Tasker POETasker

Pbs/qsub Taskerand PVMDGRM

UserRequest

Page 24: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Conclusions• MPI_Connect and PVMPI allow different MPI

implementations to inter-operate.

• Only 3 additional calls required.

• Layering requires a full profiled MPI library (complex linking).

• Intra-communication performance may be slightly effected.

• Inter-communication as fast as intercomm library used (either PVM or SNIPE).

Page 25: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

MPI_Connect interface much like Names, addresses and ports

in MPI-2

• Well know address host:port– MPI_PORT_OPEN makes a port– MPI_ACCEPT lets client connect– MPI_CONNECT client side connection

• Service naming– MPI_NAME_PUBLISH (port, info, service)– MPI_NAME_GET client side to get port

Page 26: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Server-Client model

Server

Client

MPI_Accept()

{host:port}MPI_Connect

Inter-comm

Page 27: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Server-Client model

Server

Client

MPI_Port_open()

{host:port}

{host:port}

NAME

MPI_Name_publish

MPI_Name_get

Page 28: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

SNIPE

Page 29: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

SNIPE

• Single GlobalName space– Built using RCDS supports URLs,URNs and LIFNs

• testing against LDAP• Scalable / Secure

• Multiple Resource Managers

• Safe execution environment– Java etc..

• Parallelism is the basic unit of execution

Page 30: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J.

Additional Information

• PVM : http://www.epm.ornl.gov/pvm• PVMPI : http://icl.cs.utk.edu/projects/pvmpi/• MPI_Connect : http://icl.cs.utk.edu/projects/mpi_connect/

• SNIPE : htp://www.nhse.org/snipe/ and http://icl.cs.utk.edu/projects/snipe/

• RCDS : htp://www.netlib.org/utk/projects/rcds/• ICL : http://www.netlib.org/icl/• CRPC : http://www.crpc.rice.edu/CRPC• DOD HPC MSRCs

– CEWES : http://www.wes.hpc.mil– ARL http://www.arl.hpc.mil– ASC http://www.asc.hpc.mil