Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing...
-
Upload
norma-oneal -
Category
Documents
-
view
215 -
download
0
Transcript of Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing...
Prof. Heon Y. YeomDistributed Computing Systems Lab.Seoul National University
FT-MPICH : Providing fault
tolerance for MPI parallel applications
FT-MPICH : Providing fault
tolerance for MPI parallel applications
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Motivation
Condor supports Checkpoint/Restart(C/R) mechanism only in Standard Universe. Single process jobs
C/R for parallel jobs is not provided in any of current Condor universes.
We would like to make C/R available for MPI programs.
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Introduction
Why Message Passing Interface (MPI)? Designing a generic FT framework is
extremely hard due to the diversity of hardware and software systems.
We have chosen MPICH series ....
MPI is the most popular programming model in cluster computing.
Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware…
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Architecture-Concept-
Monitoring
FailureDetection
C/R Protocol
FT-MPICHFT-MPICH
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Architecture-Overall System-
Ethernet
IPC
Management System
Communication
MPI Process
Communication
IPC Ethernet
MPI Process
Communication
IPCEthernet
MPI Process
Communication
IPC
Ethernet
Message Queue
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Management System
ManagementSystem
Makes MPI more reliable
FailureDetection
CheckpointCoordination
Recovery
InitializationCoordination
OutputManagement
CheckpointTransfer
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Manager System
MPI processLocal Manager
MPI processLocal Manager
MPI processLocal Manager Stable
Storage
Leader Manager
Initialization, CKPT cmd, CKPT transfer, Failure notification & recovery
Communication between MPI process to exchange data
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Fault-tolerant MPICH_P4
FT Module Recovery Module
ConnectionRe-establishment
Checkpoint Toolkit
Atomic M
essage
Transfer
ADI(Abstract Device Interface)Ch_p4 (Ethernet)
FT-MPICH
Ethernet
Collective Operations
P2P Operations
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Startup in Condor
Precondition Leader Manager already knows the machines
where MPI process is executed and the number of MPI process by user input
Binary of Local Manager and MPI process is located at the same location of each machine
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Startup in Condor
Job submission description file Vanilla Universe Shell script file is used in submission
description file executable points a shell script The shell file only executes Leader Manager
Ex) Example.cmd
#!/bin/shLeader_manager …
exe.sh(shell script)
universe = Vanillaexecutable = exe.shoutput = exe.outerror = exe.errlog = exe.logqueue
Example.cmd
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Startup in Condor
User submits a job using condor_submitNormal Job Startup
Condor PoolCentral Manager
Submit Machine
Submit Shadow
Schedd
Negotiator Collector
Execute Machine
Job(Leader Manager)Starter
Startd
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Startup in Condor
Leader Manager executes Local Manager Local Manager executes MPI process
Condor Pool
Central Manager
Submit MachineExecute Machine
Job(Leader Manager)
Execute Machine 1 Execute Machine 2 Execute Machine 3Local Manager Local Manager Local Manager
MPI Process MPI Process MPI Process
Fork()&
Exec()
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Startup in Condor
MPI processes send Communication Info and Leader Manager aggregates this info
Leader Manager broadcasts aggregated info
Condor Pool
Central Manager
Submit MachineExecute Machine
Job(Leader Manager)
Execute Machine 1 Execute Machine 2 Execute Machine 3Local Manager Local Manager Local Manager
MPI Process MPI Process MPI ProcessMPI Process MPI Process MPI Process
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Fault Tolerant MPI
To provide MPI fault-tolerance, we have adopted Coordinated checkpointing scheme
(vs. Independent scheme) The Leader Manager is the Coordinator!!
Application-level checkpointing (vs. kernel-level CKPT.)
This method does not require any efforts on the part of cluster administrators
User-transparent checkpointing scheme (vs. User-aware)
This method requires no modification of MPI source codes
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Atomic Message Passing
Coordination between MPI process Assumption
Communication Channel is FIFO Lock(), Unlock()
To create atomic operation
Proc 1
Lock() Lock()
Unlock() Unlock()
AtomicRegion
CKPT SIG
CKPT SIG
CKPT SIG
Checkpoint is performed!!
Checkpoint is performed!!
Checkpoint is delayed!!
Proc 0
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Atomic Message Passing(Case 1)
When MPI process receive CKPT SIG, MPI process send & receive barrier message
Proc 1
Lock() Lock()
Unlock() Unlock()
AtomicRegion
Proc 0CKPT SIGCKPT SIG
Barrier
Data
CKPT
CKPT SIGCKPT SIG CKPT
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Atomic Message Passing (Case 2)
Through sending and receiving barrier message, In-transit message is pushed to the destination
Proc 1
Lock() Lock()
Unlock() Unlock()
AtomicRegion
Proc 0CKPT SIG
CKPT SIG
Barrier
Data
Delayed CKPT
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Atomic Message Passing (Case 3)
The communication channel between MPI process is flushed Dependency between MPI process is removed
Proc 1
Lock() Lock()
Unlock() Unlock()
AtomicRegion
Proc 0
CKPT SIGCKPT SIG
Barrier
Data
Delayed CKPT
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Checkpointing
Coordinated Checkpointing
ver 2
ver 1
LeaderManager
checkpointcommand
rank0 rank1 rank2 rank3
storage
Stack
Data
Text
Heap
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Failure Recovery
MPI process recovery
Stack
Data
Text
Heap
Stack
Data
Text
Heap
CKPT Image New processRestarted Process
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Failure Recovery
Connection Re-establishment Each MPI process re-opens socket and sends
IP, Port info to Local Manager This is the same with the one we did before at the initialization time.
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Fault Tolerant MPI
Recovery from failure
failure detection
ver 1
LeaderManager
checkpointcommand
rank0 rank1 rank2 rank3
storage
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Fault Tolerant MPI in Condor
Leader Manager controls MPI processes by issuing checkpoint command, monitoring
Condor Pool
Central Manager
Submit MachineExecute Machine
Job(Leader Manager)
Execute Machine 1 Execute Machine 2 Execute Machine 3Local Manager Local Manager Local Manager
MPI Process MPI Process MPI Process
Condor is not aware of the failure incident
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Fault-tolerant MPICH-variants(Seoul National University)
FT Module Recovery Module
ConnectionRe-establishment
Ethernet
Checkpoint Toolkit
Atomic M
essage
Transfer
ADI(Abstract Device Interface)
Globus2 (Ethernet) GM (Myrinet) MVAPICH (InfiniBand)
Collective Operations
MPICH-GF
P2P Operations
M3 SHIELD
Myrinet InfiniBand
Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]
Summary
We can provide fault-tolerance for parallel applications using MPICH on Ethernet, Myrinet, and Infiniband.
Currently, only the P4(ethernet) version works with Condor.
We look forward to working with Condor team.