Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC...

109
Checkpointing HPC Applications Thomas Ropars [email protected] Universit´ e Grenoble Alpes 2016 1

Transcript of Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC...

Page 1: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Checkpointing HPC Applications

Thomas Ropars

[email protected]

Universite Grenoble Alpes

2016 1

Page 2: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Failures in supercomputers

Fault tolerance is a serious problem

• Systems with millions of components

• Failures cannot be ignoredI Due to hardwareI Due to software

Analysis of the failures in Blue Waters1

• All failure categories: MTBF of 4.2 hours

• System-wide outage: MTBF of 6.6 days

• Node failure: MTBF of 6.7 hours

1C. Di Martino et al. “Lessons Learned from the Analysis of System Failuresat Petascale: The Case of Blue Waters”. DSN’14.

2016 2

Page 3: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

A bit of vocabulary

A failure occurs when an error reaches the service interface andalters the service.

Characterization of faults/errors: persistence

• Transient (soft) errorsI Occurs once and disappearsI Eg, bit-flip due to high-energy particles

• Permanent (hard) faults/errorsI Occurs and does not go awayI Eg, a dead power supply

2016 3

Page 4: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Failure model

Correctness of a fault tolerance techniques has to be validatedagainst a failure model.

The failure model

• Crash (fail/stop) failures of nodes

• No recovery

We seek for solutions that ensures the correct termination ofparallel applications despite crash failures.

2016 4

Page 5: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Failure model

Correctness of a fault tolerance techniques has to be validatedagainst a failure model.

The failure model

• Crash (fail/stop) failures of nodes

• No recovery

We seek for solutions that ensures the correct termination ofparallel applications despite crash failures.

2016 4

Page 6: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Agenda

The basic problem

Checkpoint-based protocols

Log-based protocols

Recent contributions

Alternatives to rollback-recovery

2016 5

Page 7: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

The basic problem

Page 8: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Failures in distributed applications

P0

P1

P2

P3

P4

P5 P6

P7

P1

P0

P2

P3

P4

P5 P6

P7

Tightly coupled applications

• One process failure prevents all processes from progressing

2016 7

Page 9: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Failures in distributed applications

P0

P1

P2

P3

P4

P5 P6

P7

P1

P0

P2

P3

P4

P5 P6

P7

Tightly coupled applications

• One process failure prevents all processes from progressing

2016 7

Page 10: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Failures in distributed applications

P0

P1

P2

P3

P4

P5 P6

P7

P1

P0

P2

P3

P4

P5 P6

P7

Tightly coupled applications

• One process failure prevents all processes from progressing

2016 7

Page 11: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Problem definitionA message-passing application

• A fix set of N processes

• Communication by exchanging messagesI MPI application

• Cooperate to execute a distributed algorithm

An asynchronous distributed system

• Finite set of communication channels connecting any orderedpair of processes

I ReliableI FIFOI Ex: TCP, MPI

• AsynchronousI Unknown bound on message transmission delaysI No order between messages on different channels

2016 8

Page 12: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Problem definitionA message-passing application

• A fix set of N processes

• Communication by exchanging messagesI MPI application

• Cooperate to execute a distributed algorithm

An asynchronous distributed system

• Finite set of communication channels connecting any orderedpair of processes

I ReliableI FIFOI Ex: TCP, MPI

• AsynchronousI Unknown bound on message transmission delaysI No order between messages on different channels

2016 8

Page 13: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Problem definition

Crash failures

• When a process fails, it stops executing and communicating.

• All data stored locally is lost

Fault tolerance

• How to ensure the correct execution of the application in thepresence of faults?

I The execution should terminateI It should provide the correct result

2016 9

Page 14: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Problem definition

Crash failures

• When a process fails, it stops executing and communicating.

• All data stored locally is lost

Fault tolerance

• How to ensure the correct execution of the application in thepresence of faults?

I The execution should terminateI It should provide the correct result

2016 9

Page 15: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Backward error recovery

Rollback-recovery (other name)

• Restores the application to a previous error-free state when afailure is detected

• Information about the state of the application saved duringfailure free execution

• Assumes the error will be gone when resuming executionI Transient (soft) errorI Use spare resources to replace faulty ones in case of hard error

BER techniques

• Checkpointing: saving the system state

• Logging: saving changes made to the system

2016 10

Page 16: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Checkpointing

• Periodically save the state of the application

• Restart from last checkpoint in the event of a failure

App

ckpt 1 ckpt 2 ckpt 3 ckpt 4

Checkpoint data is saved to reliable storage:

• Reliable storage survives expected failures

• For single node failure, the memory of a neighbor node is areliable storage

• The parallel file system is a reliable storage

2016 11

Page 17: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Checkpointing

• Periodically save the state of the application

• Restart from last checkpoint in the event of a failure

App

ckpt 1 ckpt 2 ckpt 3 ckpt 4

Checkpoint data is saved to reliable storage:

• Reliable storage survives expected failures

• For single node failure, the memory of a neighbor node is areliable storage

• The parallel file system is a reliable storage

2016 11

Page 18: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Checkpointing

• Periodically save the state of the application

• Restart from last checkpoint in the event of a failure

App

ckpt 1 ckpt 2 ckpt 3 ckpt 4

Checkpoint data is saved to reliable storage:

• Reliable storage survives expected failures

• For single node failure, the memory of a neighbor node is areliable storage

• The parallel file system is a reliable storage

2016 11

Page 19: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Checkpointing

• Periodically save the state of the application

• Restart from last checkpoint in the event of a failure

App

ckpt 1 ckpt 2 ckpt 3 ckpt 4

Checkpoint data is saved to reliable storage:

• Reliable storage survives expected failures

• For single node failure, the memory of a neighbor node is areliable storage

• The parallel file system is a reliable storage

2016 11

Page 20: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Checkpointing

• Periodically save the state of the application

• Restart from last checkpoint in the event of a failure

App

ckpt 1 ckpt 2 ckpt 3 ckpt 4

Checkpoint data is saved to reliable storage:

• Reliable storage survives expected failures

• For single node failure, the memory of a neighbor node is areliable storage

• The parallel file system is a reliable storage

2016 11

Page 21: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Checkpointing a message-passing application

p0

p1

p2

p3

m0

m1

m2 m3

m4

m5 m6m5

• There is no guaranty that m5 will still exists (with the samecontent)

• Processes p0, p1 and p2 might follow a different executionpath

• The state of the application would become inconsistentI Ensuring a consistent state after the failure is the role of the

rollback-recovery protocol

2016 12

Page 22: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Checkpointing a message-passing application

p0

p1

p2

p3

m0

m1

m2 m3

m4

m5 m6m5

• There is no guaranty that m5 will still exists (with the samecontent)

• Processes p0, p1 and p2 might follow a different executionpath

• The state of the application would become inconsistentI Ensuring a consistent state after the failure is the role of the

rollback-recovery protocol

2016 12

Page 23: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Checkpointing a message-passing application

p0

p1

p2

p3

m0

m1

m2 m3

m4

m5 m6m5

• There is no guaranty that m5 will still exists (with the samecontent)

• Processes p0, p1 and p2 might follow a different executionpath

• The state of the application would become inconsistentI Ensuring a consistent state after the failure is the role of the

rollback-recovery protocol

2016 12

Page 24: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Checkpointing a message-passing application

p0

p1

p2

p3

m0

m1

m2 m3

m4

m5 m6m5

• There is no guaranty that m5 will still exists (with the samecontent)

• Processes p0, p1 and p2 might follow a different executionpath

• The state of the application would become inconsistentI Ensuring a consistent state after the failure is the role of the

rollback-recovery protocol

2016 12

Page 25: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Events and partial order

• The execution of a process can be modeled as a sequence ofevents.

• The history of process p, noted H(p), includes send(), recv()and internal events.

Lamport’s Happened-before relation1

• noted →• Events on one process are totally ordered

I If e, e’ ∈ H(p), then e → e′ or e′ → e

• send(m) → recv(m)

• TransitivityI if e → e′ and e′ → e′′, then e → e′′

1L. Lamport. “Time, Clocks, and the Ordering of Events in a DistributedSystem”. Communications of the ACM (1978).

2016 13

Page 26: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Events and partial order

• The execution of a process can be modeled as a sequence ofevents.

• The history of process p, noted H(p), includes send(), recv()and internal events.

Lamport’s Happened-before relation1

• noted →• Events on one process are totally ordered

I If e, e’ ∈ H(p), then e → e′ or e′ → e

• send(m) → recv(m)

• TransitivityI if e → e′ and e′ → e′′, then e → e′′

1L. Lamport. “Time, Clocks, and the Ordering of Events in a DistributedSystem”. Communications of the ACM (1978).

2016 13

Page 27: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Happened-before relation

p0

p1

p2

p3

m0

m1

m2 m3

m4

m5 m6

Happened-before relations:

• recv(m2) → send(m5)

• send(m3) ‖ send(m4)

2016 14

Page 28: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Consistent global stateA rollback-recovery protocol should restore the application in aconsistent global state after a failure.

• A consistent state is one that could have been seen duringfailure-free execution

• A consistent state is a state defined by a consistent cut.

DefinitionA cut C is consistent iff for all events e and e ′:

e ′ ∈ C and e → e ′ =⇒ e ∈ C

• If the state of a process reflects a message reception, then thestate of the corresponding sender should reflect the sending ofthat message

2016 15

Page 29: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Consistent global stateA rollback-recovery protocol should restore the application in aconsistent global state after a failure.

• A consistent state is one that could have been seen duringfailure-free execution

• A consistent state is a state defined by a consistent cut.

DefinitionA cut C is consistent iff for all events e and e ′:

e ′ ∈ C and e → e ′ =⇒ e ∈ C

• If the state of a process reflects a message reception, then thestate of the corresponding sender should reflect the sending ofthat message

2016 15

Page 30: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Consistent global state

p0

p1

p2

p3

m0

m1

m2 m3

m4

m5 m6

Inconsistent recovery line

• Message m5 is an orphan message

• P3 is an orphan process

2016 16

Page 31: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Consistent global state

p0

p1

p2

p3

m0

m1

m2 m3

m4

m5 m6

Inconsistent recovery line

• Message m5 is an orphan message

• P3 is an orphan process

2016 16

Page 32: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Consistent global state

p0

p1

p2

p3

m0

m1

m2 m3

m4

m5 m6

Inconsistent recovery line

• Message m5 is an orphan message

• P3 is an orphan process

2016 16

Page 33: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Before discussing protocols design

• What data to save?

• How to save the state of a process?

• Where to store the data? (reliable storage)

• How frequently to checkpoint?

2016 17

Page 34: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

What data to save?

• The non-temporary application data

• The application data that have been modified since the lastcheckpoint

Incremental checkpointing

• Monitor data modifications between checkpoints to save onlythe changes

I Save storage spaceI Reduce checkpoint time

• Makes garbage collection more complexI Garbage collection = deleting checkpoints that are no longer

useful

2016 18

Page 35: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

What data to save?

• The non-temporary application data

• The application data that have been modified since the lastcheckpoint

Incremental checkpointing

• Monitor data modifications between checkpoints to save onlythe changes

I Save storage spaceI Reduce checkpoint time

• Makes garbage collection more complexI Garbage collection = deleting checkpoints that are no longer

useful

2016 18

Page 36: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

How to save the state of a process?

Application-level checkpointing

The programmer provides the code to save the process state

© Only useful data are stored

© Checkpoint saved when the state is small

§ Difficult to control the checkpoint frequency

§ The programmer has to do the work

System-level checkpointing

The process state is saved by an external tool (ex: BLCR)

§ The whole process state is saved

© Full control on the checkpoint frequency

© Transparent for the programmer

2016 19

Page 37: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

How frequently to checkpoint?

• Checkpointing too often prevents the application from makingprogress

• Checkpointing too infrequently leads to large roll backs in theevent of a failure

Optimal checkpoint frequency depends on:

• The time to checkpoint

• The time to restart/recover

• The failure distribution

2016 20

Page 38: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Checkpoint-basedprotocols

Page 39: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Checkpointing protocols

Three categories of techniques

• Uncoordinated checkpointing

• Coordinated checkpointing

• Communication-induced checkpointing (not efficient withHPC workloads1)

1L. Alvisi et al. “An analysis of communication-induced checkpointing”.FTCS. 1999.

2016 22

Page 40: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Uncoordinated checkpointingIdeaSave checkpoints of each process independently.

p0

p1

p2

m0

m1 m2

m3 m4

m5

m6

Problem

• Is there any guaranty that we can find a consistent state aftera failure?

• Domino effectI Cascading rollbacks on all processes (unbounded)I If process p1 fails, the only consistent state we can find is the

initial state

2016 23

Page 41: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Uncoordinated checkpointingIdeaSave checkpoints of each process independently.

p0

p1

p2

m0

m1 m2

m3 m4

m5

m6

Problem

• Is there any guaranty that we can find a consistent state aftera failure?

• Domino effectI Cascading rollbacks on all processes (unbounded)I If process p1 fails, the only consistent state we can find is the

initial state

2016 23

Page 42: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Uncoordinated checkpointingIdeaSave checkpoints of each process independently.

p0

p1

p2

m0

m1 m2

m3 m4

m5

m6

Problem

• Is there any guaranty that we can find a consistent state aftera failure?

• Domino effectI Cascading rollbacks on all processes (unbounded)I If process p1 fails, the only consistent state we can find is the

initial state

2016 23

Page 43: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Uncoordinated checkpointing

Implementation

• Direct dependencies between the checkpoint intervals arerecorded

I Data piggybacked on messages and saved in the checkpoints

• Used after a failure to construct a dependency graph andcompute the recovery line

I [Bhargava and Lian, 1988]I [Wang, 1993]

Other comments

• Garbage collection is very inefficientI Hard to decide when a checkpoint is not useful anymoreI Many checkpoints may have to be stored

2016 24

Page 44: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Coordinated checkpointing

IdeaCoordinate the processes at checkpoint time to ensure that theglobal state that is saved is consistent.

• No domino effect

p0

p1

p2

m0

m1 m2

m3 m4

m5

m6

2016 25

Page 45: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Coordinated checkpointing

IdeaCoordinate the processes at checkpoint time to ensure that theglobal state that is saved is consistent.

• No domino effect

p0

p1

p2

m0

m1 m2

m3 m4

m5

m6

2016 25

Page 46: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Coordinated checkpointing

Recovery after a failure

• All processes restart from the last coordinated checkpointI Even the non-failed processes have to rollback

• Idea: Restart only the processes that depend on the failedprocess1

I In HPC apps: transitive dependencies between all processes

1R. Koo et al. “Checkpointing and Rollback-Recovery for DistributedSystems”. ACM Fall joint computer conference. 1986.

2016 26

Page 47: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Coordinated checkpointing

Other comments

• Simple and efficient garbage collectionI Only the last checkpoint should be kept

• Performance issues?I What happens when one wants to save the state of all

processes at the same time?

How to coordinate?

2016 27

Page 48: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Coordinated checkpointing

Other comments

• Simple and efficient garbage collectionI Only the last checkpoint should be kept

• Performance issues?I What happens when one wants to save the state of all

processes at the same time?

How to coordinate?

2016 27

Page 49: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

At the application level

Idea: Take advantage of the structure of the code

• The application code might already include globalsynchronization

I MPI collective operations

• In iterative codes, checkpoint every N iterations

2016 28

Page 50: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Time-based checkpointing1

Idea

• Each process takes a checkpoint at the same time

• A solution is needed to synchronize clocks

1N. Neves et al. “Coordinated checkpointing without direct coordination”.IPDS’98.

2016 29

Page 51: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Time-based checkpointing

To ensure consistency

• After checkpointing, a process should not send a message thatcould be received before the destination saved its checkpoint

I The process waits for a delay corresponding to the effectivedeviation

I The effective deviation is computed based on the clock driftand the message transmission delay

p0

p1

m

ED

t(drift)

ED = t(clock drift) −minimum transmission delay

2016 30

Page 52: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Blocking coordinated checkpointing1

1. The initiator broadcasts a checkpoint request to all processes

2. Upon reception of the request, each process stops executingthe application and saves a checkpoint, and sends ack to theinitiator

3. When the initiator has received all acks, it broadcasts ok

4. Upon reception of the ok message, each process deletes its oldcheckpoint and resumes execution of the application

p0

p1

p2

checkpoint

request

ack

ack

ok . . .

. . .

. . .

1Y. Tamir et al. “Error Recovery in Multicomputers Using GlobalCheckpoints”. ICPP. 1984.

2016 31

Page 53: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Blocking coordinated checkpointing1

1. The initiator broadcasts a checkpoint request to all processes

2. Upon reception of the request, each process stops executingthe application and saves a checkpoint, and sends ack to theinitiator

3. When the initiator has received all acks, it broadcasts ok

4. Upon reception of the ok message, each process deletes its oldcheckpoint and resumes execution of the application

p0

p1

p2

checkpoint

request

ack

ack

ok . . .

. . .

. . .

1Y. Tamir et al. “Error Recovery in Multicomputers Using GlobalCheckpoints”. ICPP. 1984.

2016 31

Page 54: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Blocking coordinated checkpointing1

1. The initiator broadcasts a checkpoint request to all processes

2. Upon reception of the request, each process stops executingthe application and saves a checkpoint, and sends ack to theinitiator

3. When the initiator has received all acks, it broadcasts ok

4. Upon reception of the ok message, each process deletes its oldcheckpoint and resumes execution of the application

p0

p1

p2

checkpoint

request

ack

ack

ok . . .

. . .

. . .

1Y. Tamir et al. “Error Recovery in Multicomputers Using GlobalCheckpoints”. ICPP. 1984.

2016 31

Page 55: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Blocking coordinated checkpointing1

1. The initiator broadcasts a checkpoint request to all processes

2. Upon reception of the request, each process stops executingthe application and saves a checkpoint, and sends ack to theinitiator

3. When the initiator has received all acks, it broadcasts ok

4. Upon reception of the ok message, each process deletes its oldcheckpoint and resumes execution of the application

p0

p1

p2

checkpoint

request

ack

ack

ok . . .

. . .

. . .

1Y. Tamir et al. “Error Recovery in Multicomputers Using GlobalCheckpoints”. ICPP. 1984.

2016 31

Page 56: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Blocking coordinated checkpointingCorrectnessDoes the global checkpoint corresponds to a consistent state, i.e.,a state with no orphan messages?

Proof sketch (by contradiction)

• We assume the state is not consistent, and there is an orphanmessage m such that:

send(m) 6∈ C and recv(m) ∈ C

• It means that m was sent after receiving ok by pi

• It also means that m was received before receiving checkpointby pj

• It implies that:

recv(m)→ recvj(ckpt)→ recvi (ok)→ send(m)

2016 32

Page 57: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Blocking coordinated checkpointingCorrectnessDoes the global checkpoint corresponds to a consistent state, i.e.,a state with no orphan messages?

Proof sketch (by contradiction)

• We assume the state is not consistent, and there is an orphanmessage m such that:

send(m) 6∈ C and recv(m) ∈ C

• It means that m was sent after receiving ok by pi

• It also means that m was received before receiving checkpointby pj

• It implies that:

recv(m)→ recvj(ckpt)→ recvi (ok)→ send(m)

2016 32

Page 58: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Non-blocking coordinated checkpointing1

• Goal: Avoid the cost of synchronization

• How to ensure consistency?

p0

p1

p2

initiator

m

• Inconsistent global state

• Message m is orphan

p0

p1

p2

initiator

• Consistent global stateI Send a marker to force p2

to save a checkpointbefore delivering m

1K. Chandy et al. “Distributed Snapshots: Determining Global States ofDistributed Systems”. ACM Transactions on Computer Systems (1985).

2016 33

Page 59: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Non-blocking coordinated checkpointing1

• Goal: Avoid the cost of synchronization

• How to ensure consistency?

p0

p1

p2

initiator

m

• Inconsistent global state

• Message m is orphan

p0

p1

p2

initiator

• Consistent global stateI Send a marker to force p2

to save a checkpointbefore delivering m

1K. Chandy et al. “Distributed Snapshots: Determining Global States ofDistributed Systems”. ACM Transactions on Computer Systems (1985).

2016 33

Page 60: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Non-blocking coordinated checkpointing1

• Goal: Avoid the cost of synchronization

• How to ensure consistency?

p0

p1

p2

initiator

m

• Inconsistent global state

• Message m is orphan

p0

p1

p2

initiator

• Consistent global stateI Send a marker to force p2

to save a checkpointbefore delivering m

1K. Chandy et al. “Distributed Snapshots: Determining Global States ofDistributed Systems”. ACM Transactions on Computer Systems (1985).

2016 33

Page 61: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Non-blocking coordinated checkpointing1

• Goal: Avoid the cost of synchronization

• How to ensure consistency?

p0

p1

p2

initiator

m

• Inconsistent global state

• Message m is orphan

p0

p1

p2

initiator

• Consistent global stateI Send a marker to force p2

to save a checkpointbefore delivering m

1K. Chandy et al. “Distributed Snapshots: Determining Global States ofDistributed Systems”. ACM Transactions on Computer Systems (1985).

2016 33

Page 62: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Non-blocking coordinated checkpointing

Assuming FIFO channels:

1. The initiator takes a checkpoint and broadcasts a checkpointrequest to all processes

2. Upon reception of the request, each process (i) takes acheckpoint, and (ii) broadcast checkpoint-request to all.No event can occur between (i) and (ii).

3. Upon reception of checkpoint-request message from all, aprocess deletes its old checkpoint

2016 34

Page 63: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Non-blocking coordinated checkpointing

Assuming FIFO channels:

1. The initiator takes a checkpoint and broadcasts a checkpointrequest to all processes

2. Upon reception of the request, each process (i) takes acheckpoint, and (ii) broadcast checkpoint-request to all.No event can occur between (i) and (ii).

3. Upon reception of checkpoint-request message from all, aprocess deletes its old checkpoint

2016 34

Page 64: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Non-blocking coordinated checkpointing

Assuming FIFO channels:

1. The initiator takes a checkpoint and broadcasts a checkpointrequest to all processes

2. Upon reception of the request, each process (i) takes acheckpoint, and (ii) broadcast checkpoint-request to all.No event can occur between (i) and (ii).

3. Upon reception of checkpoint-request message from all, aprocess deletes its old checkpoint

2016 34

Page 65: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Log-based protocols

Page 66: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Message-logging protocols

Idea: Logging the messages exchanged during failure free executionto be able to replay them in the same order after a failure

3 families of protocols

• Pessimistic

• Optimistic

• Causal

2016 36

Page 67: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Piecewise determinism

The execution of a process is a set of deterministic state intervals,each started by a non-deterministic event.

• Most of the time, the only non-deterministic events aremessage receptions

p

i

state interval

i-1 i+1

state interval

i+2

From a given initial state, playing the same sequence of messageswill always lead to the same final state.

2016 37

Page 68: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Message logging

Basic idea

• Log all non-deterministic events during failure-free execution

• After a failure, the process re-executes based on the events inthe log

Consistent state

• If all non-deterministic events have been logged, the processfollows the same execution path after the failure

I Other processes do not roll back. They wait for the failedprocess to catch up

2016 38

Page 69: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Message logging

What is logged?

• The content of the messages (payload)

• The delivery order of each message (determinant)I Sender idI Sender sequence numberI Receiver idI Receiver sequence number

2016 39

Page 70: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Where to store the data?

Sender-based message logging1

• The payload can be saved in the memory of the sender

• If the sender fails, it will generate the messages again duringrecovery

Event logging

• Determinants have to be saved on a reliable storage

• They should be available to the recovering processes

1D. B. Johnson et al. “Sender-Based Message Logging”. The 17th AnnualInternational Symposium on Fault-Tolerant Computing. 1987.

2016 40

Page 71: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Event logging

Important

• Determinants are saved by message receivers

• Event logging has an impact on performance as it involves aremote synchronization

The 3 protocol families correspond to different ways of managingdeterminants.

2016 41

Page 72: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

The always no-orphan condition1

An orphan message is a message that is seen has received, butwhose sending state interval cannot be recovered.

p0

p1

p2

p3

m0

m1 m2

If the determinants of messages m0 and m1 have not been saved,then message m2 is orphan.

1L. Alvisi et al. “Message Logging: Pessimistic, Optimistic, Causal, andOptimal”. IEEE Transactions on Software Engineering (1998).

2016 42

Page 73: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

The always no-orphan condition

• e: a non-deterministic event

• Depend(e): the set of processes whose state causally dependson e

• Log(e): the set of processes that have a copy of thedeterminant of e in their memory

• Stable(e): a predicate that is true if the determinant of e islogged on a reliable storage

To avoid orphans:

∀e : ¬Stable(e)⇒ Depend(e) ⊆ Log(e)

2016 43

Page 74: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Pessimistic message logging

Failure-free protocol

• Determinants are loggedsynchronously on reliable storage

∀e : ¬Stable(e)⇒ |Depend(e)| = 1

p

EL

detack

sending delay

Recovery

• Only the failed process has to restart

2016 44

Page 75: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Optimistic message logging

Failure-free protocol

• Determinants are loggedasynchronously (periodically) onreliable storage

p

EL

detack

risk of orphan

Recovery

• All processes whose state depends on a lost event have torollback

• Causal dependency tracking has to be implemented duringfailure-free execution

2016 45

Page 76: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Causal message logging

Failure-free protocol

• Implements the”always-no-orphan” condition

• Determinants are piggybacked onapplication messages until they aresaved on reliable storage

p[det]

Recovery

• Only the failed process has to rollback

2016 46

Page 77: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Comparison of the 3 families

Failure-free performance

• Optimistic ML is the most efficient

• Synchronizing with a remote storage is costly

• Piggybacking potentially large amount of data on messages iscostly

Recovery performance

• Pessimistic ML is the most efficient

• Recovery protocols of optimistic and causal ML can becomplex

2016 47

Page 78: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Message logging + checkpointing

Message logging is combined with checkpointing

• To reduce the extends of rollbacks in time

• To reduce the size of the logs

Which checkpointing protocol?

• Uncoordinated checkpointing can be usedI No risk of domino effect

• Nothing prevents from using coordinated checkpointing

2016 48

Page 79: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Recent contributions

Page 80: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Limits of legacy solutions at scale

Coordinated checkpointing

• Contention on the parallel file system if all processescheckpoint/restart at the same time

I More than 50% of wasted time?1

I Solution: see multi-level checkpointing

• Restarting millions of processes because of a single processfailure is a big waste of resources

1R. A. Oldfield et al. “Modeling the Impact of Checkpoints onNext-Generation Systems”. MSST 2007.

2016 50

Page 81: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Limits of legacy solutions at scale

Message logging

• Logging all messages payload consumes a lot of memoryI Running a climate simulation (CM1) on 512 processes

generates > 1GB/s of logs1

• Managing determinants is costly in terms of performanceI Frequent synchronization with a reliable storage has a high

overheadI Piggybacking information on messages penalizes

communication performance

1T. Ropars et al. “SPBC: Leveraging the Characteristics of MPI HPCApplications for Scalable Checkpointing”. SuperComputing 2013.

2016 51

Page 82: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Coordinated checkpointing + Optimistic ML1

Optimistic ML and coordinated checkpointing are combined

• Dedicated event-logger nodes are used for efficiency

Optimistic message logging

• Negligible performance overhead in failure-free execution

• If no determinant is lost in a failure, only the failed processesrestart

Coordinated checkpointing

• If determinants are lost in a failure, simply restart from thelast checkpoint

I Case of the failure of an event loggerI No complex recovery protocol

• It simplifies garbage collection of messages

1R. Riesen et al. “Alleviating scalability issues of checkpointing protocols”.SuperComputing 2012.

2016 52

Page 83: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Revisiting communication events1

Idea

• Piecewise determinism assumes all message receptions arenon-deterministic events

• In MPI most reception events are deterministicI Discriminating deterministic communication events will

improve event logging efficiency

Impact

• The cost of (pessimistic) event logging becomes negligible

1A. Bouteiller et al. “Redesigning the Message Logging Model for HighPerformance”. Concurrency and Computation : Practice and Experience(2010).

2016 53

Page 84: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Revisiting communication eventsMPI_Isend(m,req1)

MPI_Irecv(req2)

Packet 1 Packet 2 ... Packet n

post(req2) match(req2,m)

MPI_Wait(req1)

complete(req2)

MPI_Wait(req2)

P1

MPI

Library

MPI

Library

P2

send(m)

deliver(m)

New execution model2 events associated with each message reception:

• Matching between message and reception requestI Not deterministic only if ANY SOURCE is used

• Completion when the whole message content has been placedin the user buffer

I Not deterministic only for wait any/some and test functions

2016 54

Page 85: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Hierarchical protocols1

The application processes are grouped in logical clusters

Failure-free execution

• Take coordinatedcheckpoints inside clustersperiodically

• Log inter-cluster messages

Recovery

• Restart the failed clusterfrom the last checkpoint

• Replay missing inter-clustermessages from the logs

P P

P

P P

P P

P PP P

P P

PP

1A. Bouteiller et al. “Correlated Set Coordination in Fault Tolerant MessageLogging Protocols”. Euro-Par’11.

2016 55

Page 86: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Hierarchical protocols1

The application processes are grouped in logical clusters

Failure-free execution

• Take coordinatedcheckpoints inside clustersperiodically

• Log inter-cluster messages

Recovery

• Restart the failed clusterfrom the last checkpoint

• Replay missing inter-clustermessages from the logs

P P

P

P P

P P

P PP P

P P

PP

1A. Bouteiller et al. “Correlated Set Coordination in Fault Tolerant MessageLogging Protocols”. Euro-Par’11.

2016 55

Page 87: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Hierarchical protocols1

The application processes are grouped in logical clusters

Failure-free execution

• Take coordinatedcheckpoints inside clustersperiodically

• Log inter-cluster messages

Recovery

• Restart the failed clusterfrom the last checkpoint

• Replay missing inter-clustermessages from the logs

P P

P

P P

P P

P PP P

P P

PP

1A. Bouteiller et al. “Correlated Set Coordination in Fault Tolerant MessageLogging Protocols”. Euro-Par’11.

2016 55

Page 88: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Hierarchical protocols1

The application processes are grouped in logical clusters

Failure-free execution

• Take coordinatedcheckpoints inside clustersperiodically

• Log inter-cluster messages

Recovery

• Restart the failed clusterfrom the last checkpoint

• Replay missing inter-clustermessages from the logs

P P

P

P P

P P

P PP P

P P

PP

1A. Bouteiller et al. “Correlated Set Coordination in Fault Tolerant MessageLogging Protocols”. Euro-Par’11.

2016 55

Page 89: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Hierarchical protocols1

The application processes are grouped in logical clusters

Failure-free execution

• Take coordinatedcheckpoints inside clustersperiodically

• Log inter-cluster messages

Recovery

• Restart the failed clusterfrom the last checkpoint

• Replay missing inter-clustermessages from the logs

P P

P

P P

P P

P PP P

P P

PP

1A. Bouteiller et al. “Correlated Set Coordination in Fault Tolerant MessageLogging Protocols”. Euro-Par’11.

2016 55

Page 90: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Hierarchical protocols1

The application processes are grouped in logical clusters

Failure-free execution

• Take coordinatedcheckpoints inside clustersperiodically

• Log inter-cluster messages

Recovery

• Restart the failed clusterfrom the last checkpoint

• Replay missing inter-clustermessages from the logs

P P

P

P P

P P

P PP P

P P

PP

1A. Bouteiller et al. “Correlated Set Coordination in Fault Tolerant MessageLogging Protocols”. Euro-Par’11.

2016 55

Page 91: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Hierarchical protocols

Advantages

• Reduced number of logged messagesI But the determinant of all messages should be logged1

• Only a subset of the processes restart after a failureI Failure containment2

1A. Bouteiller et al. “Correlated Set Coordination in Fault Tolerant MessageLogging Protocols”. Euro-Par’11.

2J. Chung et al. “Containment Domains: A Scalable, Efficient, and FlexibleResilience Scheme for Exascale Systems”. SuperComputing 2012.

2016 56

Page 92: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Hierarchical protocols

0

10

20

30

40

50

60

0 10 20 30 40 50 60

Re

ce

ive

r R

an

k

Sender Rank

MiniFE - 64 processes - Pb size: 200x200x200

0

500000

1e+06

1.5e+06

2e+06

2.5e+06

3e+06

3.5e+06

4e+06

4.5e+06

Am

ou

nt o

f D

ata

in

Byte

s

Good applicability to most HPC workloads1

• < 15% of logged messages

• < 15% of processes to restart after a failure1T. Ropars et al. “On the Use of Cluster-Based Partial Message Logging to

Improve Fault Tolerance for MPI HPC Applications”. Euro-Par’11.2016 57

Page 93: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Revisiting execution models1

Non-deterministic algorithm

• An algorithm A is non-deterministic is its execution path isinfluenced by non-deterministic events

• Assumption we have considered until now

Send-deterministic algorithm

• An algorithm A is send-deterministic, if for an initial state Σ,and for any process p, the sequence of send events on p is thesame in any valid execution of A.

• Most HPC applications are send-deterministic

1F. Cappello et al. “On Communication Determinism in Parallel HPCApplications”. ICCCN 2010.

2016 58

Page 94: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Revisiting execution models1

Non-deterministic algorithm

• An algorithm A is non-deterministic is its execution path isinfluenced by non-deterministic events

• Assumption we have considered until now

Send-deterministic algorithm

• An algorithm A is send-deterministic, if for an initial state Σ,and for any process p, the sequence of send events on p is thesame in any valid execution of A.

• Most HPC applications are send-deterministic

1F. Cappello et al. “On Communication Determinism in Parallel HPCApplications”. ICCCN 2010.

2016 58

Page 95: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Revisiting execution models1

Non-deterministic algorithm

• An algorithm A is non-deterministic is its execution path isinfluenced by non-deterministic events

• Assumption we have considered until now

Send-deterministic algorithm

• An algorithm A is send-deterministic, if for an initial state Σ,and for any process p, the sequence of send events on p is thesame in any valid execution of A.

• Most HPC applications are send-deterministic

1F. Cappello et al. “On Communication Determinism in Parallel HPCApplications”. ICCCN 2010.

2016 58

Page 96: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Impact of send-determinism

The relative order of the messages received by a process has noimpact on its execution.

p0

p1

p2

m1

m2

m3

It is possible to design an uncoordinated checkpointing protocolthat has no risk of domino effect1.

1A. Guermouche et al. “Uncoordinated Checkpointing Without DominoEffect for Send-Deterministic Message Passing Applications”. IPDPS2011.

2016 59

Page 97: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Impact of send-determinism

The relative order of the messages received by a process has noimpact on its execution.

p0

p1

p2

m1

m2

m3

It is possible to design an uncoordinated checkpointing protocolthat has no risk of domino effect1.

1A. Guermouche et al. “Uncoordinated Checkpointing Without DominoEffect for Send-Deterministic Message Passing Applications”. IPDPS2011.

2016 59

Page 98: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Revisiting message logging protocols1

For send-deterministic MPI applications that do not includeANY SOURCE receptions:

• Message logging does not need event logging

• Only logging the payload is required

• This result applies also to hierarchical protocols

For applications including ANY SOURCE receptions:

• Minor modifications of the code are required

1T. Ropars et al. “SPBC: Leveraging the Characteristics of MPI HPCApplications for Scalable Checkpointing”. SuperComputing 2013.

2016 60

Page 99: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Revisiting message logging protocols1

For send-deterministic MPI applications that do not includeANY SOURCE receptions:

• Message logging does not need event logging

• Only logging the payload is required

• This result applies also to hierarchical protocols

For applications including ANY SOURCE receptions:

• Minor modifications of the code are required

1T. Ropars et al. “SPBC: Leveraging the Characteristics of MPI HPCApplications for Scalable Checkpointing”. SuperComputing 2013.

2016 60

Page 100: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Alternatives torollback-recovery

Page 101: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Failure prediction1

Idea

• Online analysis of supercomputers system logs to predictfailures

I Coverage of 50%I Precision of 90%

• Take advantage of this information to take preventiveactions

I Save a checkpoint before the failure occurs

1M. S. Bouguerra et al. “Improving the Computing Efficiency of HPCSystems Using a Combination of Proactive and Preventive Checkpointing”.IPDPS’13.

2016 62

Page 102: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Active replication1

rank 0 rank 1

rank 2rank 3

P00 P1

0 P01 P1

1

P02 P1

2P03 P1

3

synch synch

synchsynch

P01

1K. Ferreira et al. “Evaluating the Viability of Process Replication Reliabilityfor Exascale Systems”. SuperComputing 2011.

2016 63

Page 103: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Active replication1

rank 0 rank 1

rank 2rank 3

P00 P1

0 P01 P1

1

P02 P1

2P03 P1

3

synch synch

synchsynch

P01

1K. Ferreira et al. “Evaluating the Viability of Process Replication Reliabilityfor Exascale Systems”. SuperComputing 2011.

2016 63

Page 104: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Active replication1

rank 0 rank 1

rank 2rank 3

P00 P1

0 P01 P1

1

P02 P1

2P03 P1

3

synch synch

synchsynch

P01

1K. Ferreira et al. “Evaluating the Viability of Process Replication Reliabilityfor Exascale Systems”. SuperComputing 2011.

2016 63

Page 105: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Active replication1

rank 0 rank 1

rank 2rank 3

P00 P1

0 P01 P1

1

P02 P1

2P03 P1

3

synch synch

synchsynch

P01

1K. Ferreira et al. “Evaluating the Viability of Process Replication Reliabilityfor Exascale Systems”. SuperComputing 2011.

2016 63

Page 106: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Active replication

In the crash failure model

• Minimum overhead: 50% (2 replicas of each process)I It is actually possible to do better!

• Failure management is transparent

• Synchronization: less than 5% for send-deterministicapplications

It could be of interest to deal with silent errors

2016 64

Page 107: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Algorithmic-based fault tolerance (ABFT)

Idea

• Introduce information redundancy in the dataI Maintain the redundancy during the computation

• In the event of a failure, reconstruct the lost data thanks tothe redundant information

• Complex but very efficient solutionI Minimal amount of replicated dataI No rollback

2016 65

Page 108: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

User-Level Failure Mitigation (ULFM)

Context: Evolution of the MPI standard (fault tolerance workinggroup)

Idea

• Make the middleware fault tolerantI The application continues to run after a crash

• Expose a set of functions to allow taking actions at the userlevel after a failure:

I Failure notificationsI Checking the status of componentsI Reconfiguring the application

2016 66

Page 109: Checkpointing HPC Applications - JLESC › downloads › 5th-workshop › jlesc...Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 2016

Conclusion

• Many solutions with different trade-offsI Reference: Survey by Elnozahy et al1

• A still active research topic

• Specific solutions are requiredI Adapted to extreme scale supercomputers and applications

1E. N. Elnozahy et al. “A Survey of Rollback-Recovery Protocols inMessage-Passing Systems”. ACM Computing Surveys 34.3 (2002),pp. 375–408.

2016 67