Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

22
Providing Fault- tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University

description

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH). Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University. Contents. Motivation. 1. Introduction. 2. Architecture. 3. Conclusion. 4. Motivation. - PowerPoint PPT Presentation

Transcript of Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Page 1: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Providing Fault-tolerance for Parallel Programs on Grid

(FT-MPICH)

Providing Fault-tolerance for Parallel Programs on Grid

(FT-MPICH)

Heon Y. YeomDistributed Computing Systems Lab.Seoul National University

Page 2: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Contents

Motivation1

Introduction2

Architecture3

Conclusion4

Page 3: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Motivation

Hardware performance limitations are overcome by Moore's Law

These cutting-edge technologies make “Tera-scale” clusters feasible !!!

However.. What about “THE” system reliability ??? Distributed systems are still fragile due

to unexpected failures…

Page 4: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

MotivationMultiple

Fault-tolerantFramework

MVAPICH(InfiniBand)High-speed

(Up to 30Gbps)Will be Popular

MPICH CompatibleDemand Fault-

resilience !!!

MPICH-GM(Myrinet)High-speed (10Gbps)Popular

MPICH CompatibleDemand Fault-

resilience !!!

MPICH-G2(Ethernet)Good speed

(1Gbps)Common

MPICH StandardDemand Fault-

resilience !!!

High-performance Network Trend

Page 5: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Introduction

Unreliability of distributed systems Even a single local failure can be fatal

to parallel processes since it could render useless all computations executed to the point of failure.

Our goal is To construct a practical multiple

fault-tolerant framework for various types of MPICH variants working on high-performance clusters/Grids.

Page 6: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Introduction

Why Message Passing Interface (MPI)? Designing a generic FT framework is

extremely hard due to the diversity of hardware and software systems.

We chosen MPICH series ....

MPI is the most popular programming model in cluster computing.

Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware…

Page 7: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Architecture-Concept-

MonitoringFailure

Detection

C/R ProtocolConsensus

& ElectionProtocol

Multiple Fault-tolerant

Framework

Multiple Fault-tolerant

Framework

Page 8: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Architecture-Overall System-

Others

Ethernet

Management System

Communication

MPI Process

Communication

Ethernet Others

MPI Process

Communication

EthernetOthers

MPI Process

Communication

Ethernet

High-speed Network (Myrinet, InfiniBand)

Gigabit Ethernet

Page 9: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Architecture-Development History-

Fault-tolerantFault-tolerantMPICH-G2MPICH-G2-Ethernet--Ethernet-

Fault-tolerantFault-tolerantMPICH-GMMPICH-GM-Myrinet--Myrinet-

Fault-tolerantFault-tolerantMVAPICHMVAPICH

-InfiniBand--InfiniBand-

MPICH-GF FT-MPICH-GM FT-MVAPICH

2004 2005 Current2003

Page 10: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Management System

ManagementSystem

Makes MPI more reliable

FailureDetection

CheckpointCoordination

Recovery

InitializationCoordination

OutputManagement

CheckpointTransfer

Page 11: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Management System

Local Job Manager

MPI Process

Local Job Manager

MPI Process

Local Job Manager

MPI Process

Leader Job Manager

Third-party Scheduler(e.g. PBS, LSF)

UserCLI

Communication over Ethernet

Communication over High Speed Network(e.g. Myrinet, Infiniband)

Stable Storage

Page 12: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Job Management System 1/2

Job Management System Manages and monitors multiple MPI processes

and their execution environments Should be lightweight Helps the system take consistent checkpoints

and recover from failures Has a fault-detection mechanism

Two main components Central Manager & Local Job Manager

Page 13: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Job Management System 2/2

Central Manager Manages all system functions and states Detects node failures by periodic heartbeats

and Job Manager’s failures

Job Manager Relays messages between Central Manager &

MPI Processes Detects unexpected MPI process failures

Page 14: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Fault-Tolerant MPI 1/3

To provide MPI fault-tolerance, we adopt Coordinated checkpointing scheme

(vs. Independent scheme) The Central Manager is the Coordinator!!

Application-level checkpointing (vs. kernel-level CKPT.)

This method does not require any efforts on the part of cluster administrators

User-transparent checkpointing scheme (vs. User-aware)

This method requires no modification of MPI source codes

Page 15: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

ver 2

ver 1

Fault-Tolerant MPI 2/3

CentralManager

checkpointcommand

rank0 rank1 rank2 rank3

Coordinated Checkpointing

storage

Page 16: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

failure detection

ver 1

Fault-Tolerant MPI 3/3

CentralManager

checkpointcommand

rank0 rank1 rank2 rank3

Recovery from failures

storage

Page 17: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Management System

MPICH-GF Based on Globus Toolkit2 Hierarchical Management System

Suitable for multiple clusters Supports recovery from process/manager/node

failure Limitation

Does not support recovery from multiple failures

Has single point of failure (Central Manager)

Page 18: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Management System

FT-MPICH-GM New version

It does not rely on the Globus Toolkit. Removes of hierarchical structure

Myrinet/Infiniband clusters no longer require hierarchical structure.

Supports recovery from multiple failures

FT-MVAPICH More robust

Removes the single point of failure Leader election for the job manager

Page 19: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Fault-tolerant MPICH-variants

FT Module Recovery Module

ConnectionRe-establishment

Ethernet

Checkpoint Toolkit

Atomic M

essage

Transfer

ADI(Abstract Device Interface)

Globus2 (Ethernet) GM (Myrinet) MVAPICH (InfiniBand)

Collective Operations

MPICH-GF

P2P Operations

FT-MPICH-GM FT-MVAPICH

Myrinet InfiniBand

Page 20: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Future Works

We’re working to incorporate our FT protocol into the GT-4 framework. MPICH-GF is GT-2 compliant Incorporating fault-tolerant management

protocol into GT-4 Make MPICH work with different clusters

Gig-E Myrinet Open-MPI, VMI, etc. Infiniband

Supporting non-Intel CPUs AMD(Opteron)

Page 21: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

GRID Issues

Who should be responsible for ? Monitoring the up/down of nodes. Resubmitting the failed process. Allocating new nodes.

GRID Job Management Resource management Scheduler Health Monitoring

Page 22: Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)