Draft
1/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Combining Algorithm-Based Fault Tolerance
and Checkpointing for Iterative Solvers
Massimiliano FasiAdvisors: Yves Robert and Bora Uçar
25 june 2014
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
2/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
1 IntroductionLinear solversSilent errors
2 Algorithm-Based Fault Tolerance
3 Model
4 Experiments
5 Conclusions
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
3/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Selective reliability
High energy mode
reliable
energy wasting
Low energy mode
unreliable
energy e�cient1 2 3 4 5 6 7 8 9
low
high
computational steps
energy
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
3/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Selective reliability
High energy mode
reliable
energy wasting
Low energy mode
unreliable
energy e�cient1 2 3 4 5 6 7 8 9
low
high
computational steps
energy
computation
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
3/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Selective reliability
High energy mode
reliable
energy wasting
Low energy mode
unreliable
energy e�cient1 2 3 4 5 6 7 8 9
low
high
computational steps
energy
computation
validation
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
4/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
The Conjugate Gradient Method
Ax = b
A ∈ Rn×n, x,b ∈ Rn
Remarks on line 5
only matrix operation
A is never modi�ed
Require: A ∈ Rn×n, b, v ∈ Rn, ε ∈ REnsure: x ∈ Rn : | Ax− b |≤ ε1: r0 ← b− Ax0;
2: p0 ← r0;
3: i ← 0;
4: while ‖ri‖ > ε (‖A‖ · ‖r0‖+ ‖b‖) do5: qi ← Api ;
6: αi ← ‖ri ‖2
pᵀiqi
;
7: xi+1 ← xi + α pi ;
8: ri+1 ← ri − α qi ;
9: β ← ‖ri+1‖2
‖ri ‖2;
10: pi+1 ← ri+1 + β pi ;
11: i ← i + 1;
12: end while
13: return xi ;
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
5/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Fail-stop errors
Easy to detect
Easy to localize and characterize
Expensive to correct
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
6/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing
Well suited for fail-stop errors
Cheaper than restarting from scratch
Trade-o� the best checkpointing interval
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
7/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Silent errors
Hard to detect
Hard to localize and characterize
Easy to correct (sometimes)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
8/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Linear solversSilent errors
Checkpointing for silent errors
Is not always necessary
the computation can continue
small perturbations do not impact the solution
iterative methods can compensate some errors
Requires veri�cation
a validation mechanism has to be devised
some overhead cannot be avoided
�nding a checkpointing interval becomes even more di�cult
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
9/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Silent error sources
A x y
× =
Arithmetic operations
bit �ip of the result
Memory read
in A
bit �ip in one entryhorizontal shiftvertical shift (1 row)
in x
bit �ip in one entry
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
9/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Silent error sources
A x y
× =
Arithmetic operations
bit �ip of the result
Memory read
in A
bit �ip in one entryhorizontal shiftvertical shift (1 row)
in x
bit �ip in one entry
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
9/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Silent error sources
A x y
× =
Arithmetic operations
bit �ip of the result
Memory read
in A
bit �ip in one entryhorizontal shiftvertical shift (1 row)
in x
bit �ip in one entry
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
9/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Silent error sources
A x y
× =
Arithmetic operations
bit �ip of the result
Memory read
in A
bit �ip in one entryhorizontal shiftvertical shift (1 row)
in x
bit �ip in one entry
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
9/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Silent error sources
A x y
× =
Arithmetic operations
bit �ip of the result
Memory read
in A
bit �ip in one entryhorizontal shiftvertical shift (1 row)
in x
bit �ip in one entry
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
9/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Silent error sources
A x y
× =
Arithmetic operations
bit �ip of the result
Memory read
in A
bit �ip in one entryhorizontal shiftvertical shift (1 row)
in x
bit �ip in one entry
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
9/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Silent error sources
A x y
× =
Arithmetic operations
bit �ip of the result
Memory read
in A
bit �ip in one entryhorizontal shiftvertical shift (1 row)
in x
bit �ip in one entry
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
9/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Silent error sources
A x y
× =
Arithmetic operations
bit �ip of the result
Memory read
in A
bit �ip in one entryhorizontal shiftvertical shift (1 row)
in x
bit �ip in one entry
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
9/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Silent error sources
A x y
× =
Arithmetic operations
bit �ip of the result
Memory read
in A
bit �ip in one entryhorizontal shiftvertical shift (1 row)
in x
bit �ip in one entry
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
9/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Silent error sources
A x y
× =
Arithmetic operations
bit �ip of the result
Memory read
in A
bit �ip in one entryhorizontal shiftvertical shift (1 row)
in x
bit �ip in one entry
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
10/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
No error
24 24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
1
1
1
1
1
1
1
=
1
2
2
1
5
3
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
10/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
No error
24 24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
3
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
10/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
No error
24
24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
3
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
10/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
No error
24 24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
3
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
10/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
No error
24 2424
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
3
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
11/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in the computation
24 24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
3
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
11/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in the computation
24 24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
5
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
11/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in the computation
24
24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
5
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
11/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in the computation
24 24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
5
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
11/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in the computation
24 2426
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
5
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
12/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in x
24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
3
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
12/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in x
24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
3
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
12/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in x
24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
2
2
2
1
5
3
4
4
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
12/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in x
23
24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
2
2
2
1
5
3
4
4
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
12/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in x
23 24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
2
2
2
1
5
3
4
4
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
12/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in x
23 2423
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
2
2
2
1
5
3
4
4
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
13/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in x
24 24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
3
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
13/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in x
24 24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
3
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
13/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in x
24 24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
4
2
0
5
2
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
13/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in x
24
24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
4
2
0
5
2
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
13/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in x
24 24
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
4
2
0
5
2
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
13/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Error in x
24 2424
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
4
2
0
5
2
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
14/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
How to overcome that issue
Random weight vector
Checksum shifting
Matrix splitting
Hierarchical partitioning
(cᵀA) x = cᵀ (Ax)
c = (1 1 1 ... 1)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
14/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
How to overcome that issue
Random weight vector
Checksum shifting
Matrix splitting
Hierarchical partitioning
(cᵀA) x = cᵀ (Ax)
c = (c1 c2 c3 ... cn)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
15/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Checksum shifting
42 40
18
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
-1 0 1 7 0 6 0 11
×
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
3
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
15/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Checksum shifting
42 40
18
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
2 2 2 2 2 2 2 2
-1 0 1 7 0 6 0 11
×
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
3
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
15/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Checksum shifting
42 40
18
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
2 2 2 2 2 2 2 2
1 2 3 9 2 8 2 13
×
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
2
2
1
5
3
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
15/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Checksum shifting
42 40
18
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
2 2 2 2 2 2 2 2
1 2 3 9 2 8 2 13
×
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
4
2
0
5
2
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
15/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Checksum shifting
42 40
18
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
2 2 2 2 2 2 2 2
1 2 3 9 2 8 2 13
×
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
4
2
0
5
2
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
15/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Checksum shifting
42
40
18
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
2 2 2 2 2 2 2 2
1 2 3 9 2 8 2 13
×
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
4
2
0
5
2
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
15/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Checksum shifting
42 40
18
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
2 2 2 2 2 2 2 2
1 2 3 9 2 8 2 13
×
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
4
2
0
5
2
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
15/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Checksum shifting
42 4042
18
1
2
3
4
5
6
7
8
-2 2
2 -4 1
-1 -2
-1 1 -3
-3
-2
2 2 2 2 2 2 2 2
1 2 3 9 2 8 2 13
×
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
1
4
2
0
5
2
4
6
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
16/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Summary of ABFT results
checksumcomputation
SpMxVoverhead
single error detection ∼ nnz ∼ 4nk errors detection ∼ k nnz ∼ 4kn
single error correction ∼ 2 nnz ∼ 8nk errors correction ? ?
Table : ABFT techniques
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
17/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Not all that seems so is an error
Theorem
Let A ∈ Rn×n, x ∈ Rn, c ∈ Rn. Then, if all of the sums involvedinto the matrix operations are performed using some �avour ofrecursive summation, it holds that
| � ((cᵀA) x)− � (cᵀ (Ax)) |≤ 2 γ2n | cᵀ | | A | | x | .
| � ((cᵀA) x)− � (cᵀ (Ax)) |≤ 2 γ2n n ‖cᵀ‖∞ ‖A‖1 ‖x‖∞
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
17/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Not all that seems so is an error
Theorem
Let A ∈ Rn×n, x ∈ Rn, c ∈ Rn. Then, if all of the sums involvedinto the matrix operations are performed using some �avour ofrecursive summation, it holds that
| � ((cᵀA) x)− � (cᵀ (Ax)) |≤ 2 γ2n | cᵀ | | A | | x | .
| � ((cᵀA) x)− � (cᵀ (Ax)) |≤ 2 γ2n n ‖cᵀ‖∞ ‖A‖1 ‖x‖∞
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
18/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Preliminaries
Why combining
checkpointing (CP) needs a veri�cation mechanism
ABFT's worst case could require restarting from scratch
Why a trade-o�
CP interval depends on the probability of incorrectable errors
per iteration overhead depends on the kind of ABFT protection
Goal: minimize the expected global execution time
Idea: minimize the expected overhead (ABFT and CP) of a frame
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
19/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Expected execution time
p = correctable error probability
s = checkpoint interval
k = correctable errors
E(Ts
)= p s Titer + (1− p )
(E (Tlost) + Trecovery + E
(Ts
) )
pk =k∑
i=0
q(k)i (s T
(k)iter ), q
(k)` (T ) =
(M
`
)(1− e−λT
)`e−λT (M−`)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
19/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Expected execution time
p = correctable error probability
s = checkpoint interval
k = correctable errors
E(Ts
)= p s Titer + (1− p )
(E (Tlost) + Trecovery + E
(Ts
) )
pk =k∑
i=0
q(k)i (s T
(k)iter ), q
(k)` (T ) =
(M
`
)(1− e−λT
)`e−λT (M−`)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
19/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Expected execution time
p = correctable error probability
s = checkpoint interval
k = correctable errors
E(Ts
)= p s Titer + (1− p )
(E (Tlost) + Trecovery + E
(Ts
) )
pk =k∑
i=0
q(k)i (s T
(k)iter ), q
(k)` (T ) =
(M
`
)(1− e−λT
)`e−λT (M−`)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
19/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Expected execution time
p = correctable error probability
s = checkpoint interval
k = correctable errors
E(Ts
)= p s Titer + (1− p )
(E (Tlost) + Trecovery + E
(Ts
) )
pk =k∑
i=0
q(k)i (s T
(k)iter ), q
(k)` (T ) =
(M
`
)(1− e−λT
)`e−λT (M−`)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
19/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Expected execution time
p = correctable error probability
s = checkpoint interval
k = correctable errors
E(Ts
)= p s Titer + (1− p )
((s + 1)
2Titer + Trecovery + E
(Ts
) )
pk =k∑
i=0
q(k)i (s T
(k)iter ), q
(k)` (T ) =
(M
`
)(1− e−λT
)`e−λT (M−`)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
19/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Expected execution time
p = correctable error probability
s = checkpoint interval
k = correctable errors
E(T (k)s
)= pk s T
(k)iter + (1− pk)
((s + 1)
2T
(k)iter +Trecovery + E
(T (k)s
))
pk =k∑
i=0
q(k)i (s T
(k)iter ), q
(k)` (T ) =
(M
`
)(1− e−λT
)`e−λT (M−`)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
19/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Expected execution time
p = correctable error probability
s = checkpoint interval
k = correctable errors
E(T (k)s
)= pk s T
(k)iter + (1− pk)
((s + 1)
2T
(k)iter +Trecovery + E
(T (k)s
))
pk =k∑
i=0
q(k)i (s T
(k)iter ), q
(k)` (T ) =
(M
`
)(1− e−λT
)`e−λT (M−`)
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
20/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
A probabilistic model
Model
The checkpoint interval that minimizes the expected wasted time is
s = argmins∈N
E(T
(k)s
)− s T
(k)iter + Tcheckpoint
s T(k)iter
.
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
21/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Test problems
n nnz(A) κ(A) Convergence
BCSSTK09 1083 18437 3.10173e+04 linearP3D 27000 183600 6.45723e+02 quadraticTHERMAL1 82654 574458 4.96250e+05 sublinear
[From similar studies]Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
22/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Empirical validation
0 10 20 30 40 50 60 70 80 90 1000
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80 90 1002
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
0 10 20 30 40 50 60 70 80 90 1004
5
6
7
8
9
10
11
12
13
14
0 10 20 30 40 50 60 70 80 90 1000
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80 90 1002
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
0 10 20 30 40 50 60 70 80 90 1004
5
6
7
8
9
10
11
12
13
14
Figure : Execution time vs checkpoint interval. The expected execution time(continuous line) is compared with the experimentally obtained one (circles),for both CP + ABFT detection (top) and CP + ABFT correction (bottom).
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
23/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Experimental comparison
101
102
103
104
1050.2
0.3
0.4
0.5
0.6
0.7
0.8CG-1D
CG-2D1C
101
102
103
104
1052.5
3
3.5
4
4.5
5
5.5CG-1D
CG-2D1C
101
102
103
104
1054
5
6
7
8
9
10CG-1D
CG-2D1C
Figure : Execution time vs reciprocal of the normalized fault rate for bothplain checkpointing (CG-1D) and mixed strategy (CG-2D1C).
min max
BCSSTK09 -2.23 % 12.78 %P3D -0.60 % 26.76 %THERMAL1 -0.08 % 40.44 %
Table : Relative gain of CG-2D1C with respect to CG-1D.
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
24/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Summary
silent errors are treacherous
checkpointing needs a veri�cation mechanism
detecting ABFT is a cheap and reliable
correcting ABFT can improve checkpointing's performances
a trade-o� can be established
the same analysis holds for other iterative linear solvers
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Draft
25/25
IntroductionAlgorithm-Based Fault Tolerance
ModelExperimentsConclusions
Future work
General ABFT improvements
extend error correction capabilities for matrix representations
extension to other matrix operations
develop accurate estimates for �oating point errors
Other applications of the ABFT/checkpointing solution
Preconditioned Conjugate Gradient
ABFT for dense iterative methods
Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers
Top Related