Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing...

26
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications

Transcript of Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing...

Page 1: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Prof. Heon Y. YeomDistributed Computing Systems Lab.Seoul National University

FT-MPICH : Providing fault

tolerance for MPI parallel applications

FT-MPICH : Providing fault

tolerance for MPI parallel applications

Page 2: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Motivation

Condor supports Checkpoint/Restart(C/R) mechanism only in Standard Universe. Single process jobs

C/R for parallel jobs is not provided in any of current Condor universes.

We would like to make C/R available for MPI programs.

Page 3: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Introduction

Why Message Passing Interface (MPI)? Designing a generic FT framework is

extremely hard due to the diversity of hardware and software systems.

We have chosen MPICH series ....

MPI is the most popular programming model in cluster computing.

Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware…

Page 4: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Architecture-Concept-

Monitoring

FailureDetection

C/R Protocol

FT-MPICHFT-MPICH

Page 5: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Architecture-Overall System-

Ethernet

IPC

Management System

Communication

MPI Process

Communication

IPC Ethernet

MPI Process

Communication

IPCEthernet

MPI Process

Communication

IPC

Ethernet

Message Queue

Page 6: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Management System

ManagementSystem

Makes MPI more reliable

FailureDetection

CheckpointCoordination

Recovery

InitializationCoordination

OutputManagement

CheckpointTransfer

Page 7: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Manager System

MPI processLocal Manager

MPI processLocal Manager

MPI processLocal Manager Stable

Storage

Leader Manager

Initialization, CKPT cmd, CKPT transfer, Failure notification & recovery

Communication between MPI process to exchange data

Page 8: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Fault-tolerant MPICH_P4

FT Module Recovery Module

ConnectionRe-establishment

Checkpoint Toolkit

Atomic M

essage

Transfer

ADI(Abstract Device Interface)Ch_p4 (Ethernet)

FT-MPICH

Ethernet

Collective Operations

P2P Operations

Page 9: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Startup in Condor

Precondition Leader Manager already knows the machines

where MPI process is executed and the number of MPI process by user input

Binary of Local Manager and MPI process is located at the same location of each machine

Page 10: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Startup in Condor

Job submission description file Vanilla Universe Shell script file is used in submission

description file executable points a shell script The shell file only executes Leader Manager

Ex) Example.cmd

#!/bin/shLeader_manager …

exe.sh(shell script)

universe = Vanillaexecutable = exe.shoutput = exe.outerror = exe.errlog = exe.logqueue

Example.cmd

Page 11: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Startup in Condor

User submits a job using condor_submitNormal Job Startup

Condor PoolCentral Manager

Submit Machine

Submit Shadow

Schedd

Negotiator Collector

Execute Machine

Job(Leader Manager)Starter

Startd

Page 12: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Startup in Condor

Leader Manager executes Local Manager Local Manager executes MPI process

Condor Pool

Central Manager

Submit MachineExecute Machine

Job(Leader Manager)

Execute Machine 1 Execute Machine 2 Execute Machine 3Local Manager Local Manager Local Manager

MPI Process MPI Process MPI Process

Fork()&

Exec()

Page 13: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Startup in Condor

MPI processes send Communication Info and Leader Manager aggregates this info

Leader Manager broadcasts aggregated info

Condor Pool

Central Manager

Submit MachineExecute Machine

Job(Leader Manager)

Execute Machine 1 Execute Machine 2 Execute Machine 3Local Manager Local Manager Local Manager

MPI Process MPI Process MPI ProcessMPI Process MPI Process MPI Process

Page 14: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Fault Tolerant MPI

To provide MPI fault-tolerance, we have adopted Coordinated checkpointing scheme

(vs. Independent scheme) The Leader Manager is the Coordinator!!

Application-level checkpointing (vs. kernel-level CKPT.)

This method does not require any efforts on the part of cluster administrators

User-transparent checkpointing scheme (vs. User-aware)

This method requires no modification of MPI source codes

Page 15: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Atomic Message Passing

Coordination between MPI process Assumption

Communication Channel is FIFO Lock(), Unlock()

To create atomic operation

Proc 1

Lock() Lock()

Unlock() Unlock()

AtomicRegion

CKPT SIG

CKPT SIG

CKPT SIG

Checkpoint is performed!!

Checkpoint is performed!!

Checkpoint is delayed!!

Proc 0

Page 16: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Atomic Message Passing(Case 1)

When MPI process receive CKPT SIG, MPI process send & receive barrier message

Proc 1

Lock() Lock()

Unlock() Unlock()

AtomicRegion

Proc 0CKPT SIGCKPT SIG

Barrier

Data

CKPT

CKPT SIGCKPT SIG CKPT

Page 17: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Atomic Message Passing (Case 2)

Through sending and receiving barrier message, In-transit message is pushed to the destination

Proc 1

Lock() Lock()

Unlock() Unlock()

AtomicRegion

Proc 0CKPT SIG

CKPT SIG

Barrier

Data

Delayed CKPT

Page 18: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Atomic Message Passing (Case 3)

The communication channel between MPI process is flushed Dependency between MPI process is removed

Proc 1

Lock() Lock()

Unlock() Unlock()

AtomicRegion

Proc 0

CKPT SIGCKPT SIG

Barrier

Data

Delayed CKPT

Page 19: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Checkpointing

Coordinated Checkpointing

ver 2

ver 1

LeaderManager

checkpointcommand

rank0 rank1 rank2 rank3

storage

Stack

Data

Text

Heap

Page 20: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Failure Recovery

MPI process recovery

Stack

Data

Text

Heap

Stack

Data

Text

Heap

CKPT Image New processRestarted Process

Page 21: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Failure Recovery

Connection Re-establishment Each MPI process re-opens socket and sends

IP, Port info to Local Manager This is the same with the one we did before at the initialization time.

Page 22: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Fault Tolerant MPI

Recovery from failure

failure detection

ver 1

LeaderManager

checkpointcommand

rank0 rank1 rank2 rank3

storage

Page 23: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Fault Tolerant MPI in Condor

Leader Manager controls MPI processes by issuing checkpoint command, monitoring

Condor Pool

Central Manager

Submit MachineExecute Machine

Job(Leader Manager)

Execute Machine 1 Execute Machine 2 Execute Machine 3Local Manager Local Manager Local Manager

MPI Process MPI Process MPI Process

Condor is not aware of the failure incident

Page 24: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Fault-tolerant MPICH-variants(Seoul National University)

FT Module Recovery Module

ConnectionRe-establishment

Ethernet

Checkpoint Toolkit

Atomic M

essage

Transfer

ADI(Abstract Device Interface)

Globus2 (Ethernet) GM (Myrinet) MVAPICH (InfiniBand)

Collective Operations

MPICH-GF

P2P Operations

M3 SHIELD

Myrinet InfiniBand

Page 25: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 [email protected]

Summary

We can provide fault-tolerance for parallel applications using MPICH on Ethernet, Myrinet, and Infiniband.

Currently, only the P4(ethernet) version works with Condor.

We look forward to working with Condor team.

Page 26: Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.