Fault Tolerant MPImeseec.ce.rit.edu/756-projects/spring2006/d3/2/fault... · 2006. 5. 23. · Fault...
Transcript of Fault Tolerant MPImeseec.ce.rit.edu/756-projects/spring2006/d3/2/fault... · 2006. 5. 23. · Fault...
Fault Tolerant MPIFault Tolerant MPIProtocols and ImplementationsProtocols and Implementations
Brian J. ArgauerBrian J. ArgauerStephen R. ByersStephen R. Byers
May 23, 2006May 23, 2006
Multiple Processor Systems EECC 756Multiple Processor Systems EECC 756Dr. Muhammad ShaabanDr. Muhammad Shaaban
OutlineOutline
Motivation for Fault ToleranceMotivation for Fault ToleranceTechniquesTechniques
CheckpointCheckpointMessage LoggingMessage Logging
ImplementationsImplementationsMPICHMPICH--V1V1MPICHMPICH--V2V2MPICHMPICH--VCLVCL
CoCheck FrameworkCoCheck FrameworkConclusionConclusion
Fault Tolerant MotivationFault Tolerant Motivation
Current trend toward Current trend toward larger clusters, larger clusters, distributed and distributed and GRID computingGRID computingSource of FailureSource of Failure
NodesNodesNetworkNetworkHuman FactorsHuman Factors
Thousands of nodes Thousands of nodes reduces MTBF to reduces MTBF to hours or minuteshours or minutes
Techniques: CheckpointTechniques: CheckpointCapture entire state of taskCapture entire state of task
Application, Stack, Allocated memory, etc.Application, Stack, Allocated memory, etc.Program FailureProgram Failure
Kill survivorsKill survivorsRestart from last consistent and complete set of checkpointsRestart from last consistent and complete set of checkpoints
CoordinatedCoordinatedExpensiveExpensiveAll tasks stop message passingAll tasks stop message passingWrite to disks simultaneouslyWrite to disks simultaneouslyContinue Message PassingContinue Message Passing
UncoordinatedUncoordinatedNodes Checkpoint at different timesNodes Checkpoint at different timesInIn--flight messages retained via explicit loggingflight messages retained via explicit logging
Techniques: Message LoggingTechniques: Message Logging
Pessimistic LogPessimistic LogTransaction loggingTransaction loggingNo incoherent states can be reachedNo incoherent states can be reachedCan handle an unbounded number of faultsCan handle an unbounded number of faults
Optimistic LogOptimistic LogLog messagesLog messagesAssume part of log lost when faults occurAssume part of log lost when faults occurEither rollback entire application if too many faults or Either rollback entire application if too many faults or assume only 1 fault at a time can occur in systemassume only 1 fault at a time can occur in system
Causal LogCausal Log
Technique DecisionTechnique Decision
Checkpointing is efficient with low fault frequency!Checkpointing is efficient with low fault frequency!Message logging is efficient higher fault frequency!Message logging is efficient higher fault frequency!
MPICHMPICH--V IntroductionV Introduction
Traditional MPITraditional MPIStatic ResourcesStatic ResourcesLimited Error HandingLimited Error HandingNode failure stalls or slows down other nodesNode failure stalls or slows down other nodes
MPICHMPICH--VVResearch effort to provide MPI implementation Research effort to provide MPI implementation based on MPICHbased on MPICHAutomatic fault tolerant MPI libraryAutomatic fault tolerant MPI libraryImplementations: MPICHImplementations: MPICH--V1, MPICHV1, MPICH--V2, V2, MPICHMPICH--VCausal, MPICHVCausal, MPICH--VCLVCL
Fault Tolerant OverviewFault Tolerant Overview
MPICH-VCLMPICH-V1/V2
FT-MPI
Automatic Non-Automatic
Co-Check
MPICH-V/Causal
MPICHMPICH--V1 GoalsV1 Goals
Volatility ToleranceVolatility ToleranceRedundancyRedundancyTask MigrationTask Migration
Highly DistributedHighly DistributedScalable Scalable Asynchronous Asynchronous CheckpointingCheckpointingNo Global No Global SynchronizationSynchronization
InterInter--administration administration domain domain communicationscommunications
Security Tools for Security Tools for GRID DeploymentGRID DeploymentUse nonUse non--protected protected relay between client relay between client and server nodes if and server nodes if client and server client and server both fire walledboth fire walled
MPICHMPICH--V1 OverviewV1 Overview
Designed from standard MPI Designed from standard MPI ImplementationImplementationRun existing MPI applications without Run existing MPI applications without modificationmodificationSuitable for very large scale Suitable for very large scale computing using heterogeneous computing using heterogeneous networksnetworksUncoordinated CheckpointUncoordinated CheckpointRemote Pessimistic Message LoggingRemote Pessimistic Message Logging
MPICHMPICH--V1 ArchitectureV1 Architecture
Checkpoint ServerCheckpoint ServerStore and provide Store and provide task imagestask imagesImages sent to CS Images sent to CS as generated by as generated by nodesnodesImage clone of Image clone of running process on running process on given nodegiven node
MPICHMPICH--V1 ArchitectureV1 Architecture
Channel MemoryChannel MemoryStorage of inStorage of in--transit transit messagesmessagesRepository servicesRepository services
DispatcherDispatcherResource SchedulingResource SchedulingTask ManagementTask Management
MPICHMPICH--V1 PerformanceV1 Performance
Fault Tolerant OverviewFault Tolerant Overview
MPICH-VCLMPICH-V1/V2
FT-MPI
Automatic Non-Automatic
Co-Check
MPICH-V/Causal
MPICHMPICH--V2V2Pessimistic Logging for large clustersPessimistic Logging for large clustersUncoordinated CheckpointUncoordinated CheckpointNodes store messages they send locallyNodes store messages they send locallyEvent Loggers store sequence of received messages for each nodeEvent Loggers store sequence of received messages for each node
Fault Tolerant OverviewFault Tolerant Overview
MPICH-VCLMPICH-V1/V2
FT-MPI
Automatic Non-Automatic
Co-Check
MPICH-V/Causal
MPICHMPICH--VCLVCLNewest MPICHNewest MPICH--VVDesigned for extra low latency dependent applicationsDesigned for extra low latency dependent applicationsCoordinated CheckpointCoordinated Checkpoint
MPICHMPICH--VCL PerformanceVCL Performance
NAS Benchmark BT NAS Benchmark BT Class B, 25 Nodes, Class B, 25 Nodes, Fast EthernetFast EthernetPerformance Performance crossover point crossover point between checkpoint between checkpoint and message and message logging: 1 fault every logging: 1 fault every 3 minutes3 minutes
Fault Tolerant OverviewFault Tolerant Overview
MPICH-VCLMPICH-V1/V2
FT-MPI
Automatic Non-Automatic
Co-Check
MPICH-V/Causal
CoCheck FrameworkCoCheck Framework
Abstraction FrameworkAbstraction FrameworkAbove message passing layerAbove message passing layerEasily adaptable and portable to different Easily adaptable and portable to different MPI implementations through the use of MPI implementations through the use of wrapper functionswrapper functionsProvides consistencyProvides consistencyConsiders checkpointing & process Considers checkpointing & process migrationmigrationtuMPItuMPI
State Consistency ProblemState Consistency Problem
Processes A,B,CProcesses A,B,CCircles = Events, Arrows = Message SendingCircles = Events, Arrows = Message SendingS, S’, S’’ = Checkpoint SnapshotsS, S’, S’’ = Checkpoint SnapshotsNotice S’’ is inconsistentNotice S’’ is inconsistent
Clearing Communication LinesClearing Communication Lines
Uses Coordinator ProcessUses Coordinator ProcessSends “ReadySends “Ready--Message” Message” (RM) when checkpoint or (RM) when checkpoint or migration is needed. migration is needed. If process receives RM, If process receives RM, assumes no more assumes no more communicationcommunicationOnce all RMs are Once all RMs are received... can checkpoint received... can checkpoint or migrateor migrateOn restart… check for On restart… check for messages in buffermessages in buffer
Performance & Future ResearchPerformance & Future Research
Single processor migration Single processor migration resultsresults
vs. num processors & size of vs. num processors & size of checkpoint imagecheckpoint image
8 Machines8 MachinesMix of Sun SparcStation 2 Mix of Sun SparcStation 2 and Sparc 10and Sparc 10
Dominating FactorDominating FactorImage SizeImage Size
Future ConsiderationsFuture ConsiderationsAutomatic load performance Automatic load performance and balancingand balancing
ConclusionConclusion
Current trendCurrent trendIncreasing cluster size Increasing cluster size Lower MTBFLower MTBFFault tolerance increasingly importantFault tolerance increasingly important
Fault tolerant implementations in MPI offer Fault tolerant implementations in MPI offer assortment of solutionsassortment of solutionsNew research yielding new improvements & New research yielding new improvements & ideas to enhance efficiency and robustness of ideas to enhance efficiency and robustness of fault tolerant systems.fault tolerant systems.
Questions?Questions?References:References:
G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C.G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, A. Selikhov Lemarinier, O. Lodygensky, F. Magniette, V. Neri, A. Selikhov MPICHMPICH--V: Toward a Scalable V: Toward a Scalable Fault Tolerant MPI for Volatile NodesFault Tolerant MPI for Volatile Nodes, LRI, Université de Paris Sud, Orsay, France (2002) , LRI, Université de Paris Sud, Orsay, France (2002) IEEE.IEEE.G. Stellner, G. Stellner, CoCheck: Checkpointing and Process Migration for MPICoCheck: Checkpointing and Process Migration for MPI, Institüt fur Informatik , Institüt fur Informatik der Tecnischen Universität München, Müchen, Germanyder Tecnischen Universität München, Müchen, GermanyG.G. FaggFagg,, Fault TolerantFault Tolerant MPIMPI,, LinuxLinux Magazine (Magazine (NovemberNovember 2004). 2004). [Online]. Available: [Online]. Available: http://www.linuxhttp://www.linux--mag.com/index2.php?option=com_content&task=view&id=1781&Itemid=2mag.com/index2.php?option=com_content&task=view&id=1781&Itemid=2070&pop=1&pag070&pop=1&page=0e=0A. Bouteiller, P.A. Bouteiller, P. LemarinierLemarinier, G., G. KraqezikKraqezik, F., F. CappelloCappello,, Coordinated CheckpointCoordinated Checkpoint versus versus Message Log forMessage Log for Fault TolerantFault Tolerant MPIMPI . LRI, Université de Paris Sud, Orsay, France. LRI, Université de Paris Sud, Orsay, FranceW. Gropp, E. Lusk, W. Gropp, E. Lusk, Fault Tolerance in MPI ProgramsFault Tolerance in MPI Programs, Argonne National Laboratory, Argone , Argonne National Laboratory, Argone ILILA. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, F. CappelA. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, F. Cappello lo MPICHMPICH--V ProjectV Project : A : A Multiprotocol Automatic Fault Tolerant MPIMultiprotocol Automatic Fault Tolerant MPI, INRIA/LRI, Université de Paris Sud, Orsay, , INRIA/LRI, Université de Paris Sud, Orsay, France.France.MPICHMPICH--V IntroductionV Introduction. . [Online]. Available: http://mpich[Online]. Available: http://mpich--v.lri.fr/index.php?section=intro&subsection=introv.lri.fr/index.php?section=intro&subsection=intro