On Numerical Resiliency in Numerical Linear...
Transcript of On Numerical Resiliency in Numerical Linear...
Salishan WorkshopApril 28, 2015Gleneden Beach, Oregon
On Numerical Resiliency inNumerical Linear Algebra Solvers
Luc Giraudjoint work with
E. Agullo (Inria), P. Salas (Sherbrooke Univ.),F. Yetkin (Inria) and M. Zounon (Inria)
HiePACS - Inria Project
Inria Bordeaux Sud-Ouest
Outline
1. Hard fault: Interpolation-Restart strategiesIn Krylov subspace linear solversIn revisited eigensolvers
2. Soft errors in Conjugate Gradient (preliminary)
L. Giraud - On Numerical Resilience in Linear Algebra 2 / 31
Hard fault: Interpolation-Restart strategies
Outline
1. Hard fault: Interpolation-Restart strategiesIn Krylov subspace linear solversIn revisited eigensolvers
2. Soft errors in Conjugate Gradient (preliminary)
L. Giraud - On Numerical Resilience in Linear Algebra 3 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
L. Giraud - On Numerical Resilience in Linear Algebra 4 / 31
x bA
=
Ax = bObjective: Design resilient algorithms forsolving sparse linear systems
Two classes of iterative methodsStationary methods (Jacobi, Gauss-Seidel, . . . )Krylov subspace methods (CG, GMRES, Bi-CGStab, . . . )
Krylov methods are widely used but efforts are requied to gofor extreme-scale
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
L. Giraud - On Numerical Resilience in Linear Algebra 5 / 31
����������������
����������������
x bA
P
P
P
P
1
2
3
4
=
Dynamic data Static data
We distinguish two categories of data:
Static dataDynamic data
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
L. Giraud - On Numerical Resilience in Linear Algebra 5 / 31
����������������
����������������
����������������
����������������
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data
=
Lost data
We distinguish two categories of data:
Static dataDynamic data
What will happen when P1 crashes?
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
L. Giraud - On Numerical Resilience in Linear Algebra 5 / 31
����������������
����������������
����������������
����������������
x bA
P
P
P
P
1
2
3
4
=
Static data Lost dataDynamic data
We distinguish two categories of data:
Static dataDynamic data
What will happen when P1 crashes?
Failed process is replacedStatic data are restored
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
L. Giraud - On Numerical Resilience in Linear Algebra 5 / 31
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data Lost data
=
We distinguish two categories of data:
Static dataDynamic data
What will happen when P1 crashes?
Failed process is replacedStatic data are restored
Reset: set x1 to the initial value and restartIR: interpolate x1 and restart
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
L. Giraud - On Numerical Resilience in Linear Algebra 5 / 31
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data Interpolated data
=
We distinguish two categories of data:
Static dataDynamic data
What will happen when P1 crashes?
Failed process is replacedStatic data are restored
Reset: set x1 to the initial value and restartIR: interpolate x1 and restart
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
General assumptions
1. Large dimensional vectors and matrices are distributed
2. Scalars and low dimensional vectors and matrices are replicated
3. There is a system mechanism to report faults
4. Faulty process is replaced
5. Static data are restored
L. Giraud - On Numerical Resilience in Linear Algebra 6 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Interpolation methods
Fault in linear system(A11 A12A21 A22
)(x1x2
)=
(b1b2
)
Linear Interpolation (LI) [J. Langou et al., 2007]
Solve A11x1 = b1 − A12x2 A11 must be non singular
Least Squares Interpolation (LSI)(A11A21
)x1 +
(A21A22
)x2 =
(b1b2
)x1 = argmin
x
∥∥∥∥(b1b2
)−
(A12A22
)x2 −
(A11A21
)x∥∥∥∥
2
L. Giraud - On Numerical Resilience in Linear Algebra 7 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Interpolation methods
Fault in linear system(A11 A12A21 A22
)(?x2
)=
(b1b2
)How to interpolate x1?
Linear Interpolation (LI) [J. Langou et al., 2007]
Solve A11x1 = b1 − A12x2 A11 must be non singular
Least Squares Interpolation (LSI)(A11A21
)x1 +
(A21A22
)x2 =
(b1b2
)x1 = argmin
x
∥∥∥∥(b1b2
)−
(A12A22
)x2 −
(A11A21
)x∥∥∥∥
2
L. Giraud - On Numerical Resilience in Linear Algebra 7 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Interpolation methods
Fault in linear system(A11 A12A21 A22
)(?x2
)=
(b1b2
)How to interpolate x1?
Linear Interpolation (LI) [J. Langou et al., 2007]
Solve A11x1 = b1 − A12x2 A11 must be non singular
Least Squares Interpolation (LSI)(A11A21
)x1 +
(A21A22
)x2 =
(b1b2
)x1 = argmin
x
∥∥∥∥(b1b2
)−
(A12A22
)x2 −
(A11A21
)x∥∥∥∥
2
L. Giraud - On Numerical Resilience in Linear Algebra 7 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Interpolation methods
Fault in linear system(A11 A12A21 A22
)(?x2
)=
(b1b2
)How to interpolate x1?
Linear Interpolation (LI) [J. Langou et al., 2007]
Solve A11x1 = b1 − A12x2 A11 must be non singular
Least Squares Interpolation (LSI)(A11A21
)x1 +
(A21A22
)x2 =
(b1b2
)x1 = argmin
x
∥∥∥∥(b1b2
)−
(A12A22
)x2 −
(A11A21
)x∥∥∥∥
2
L. Giraud - On Numerical Resilience in Linear Algebra 7 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Main properties of Interpolation-Restart strategies
LI propertyThe LI strategy preserves the monotone decrease of the A-norm ofthe forward error of CG or PCG.
LSI propertyThe LSI strategy preserves the monotone decrease of the residualnorm of minimal residual Krylov subspace methods such asGMRES and MinRES.
L. Giraud - On Numerical Resilience in Linear Algebra 8 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Main properties of Interpolation-Restart strategies
LI propertyThe LI strategy preserves the monotone decrease of the A-norm ofthe forward error of CG or PCG.
LSI propertyThe LSI strategy preserves the monotone decrease of the residualnorm of minimal residual Krylov subspace methods such asGMRES and MinRES.
L. Giraud - On Numerical Resilience in Linear Algebra 8 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Experimental results
L. Giraud - On Numerical Resilience in Linear Algebra 9 / 31
Impact of lost data volume
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Impact of lost data volume
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
NF
GMRES(100) - Averous/epb3
L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Impact of lost data volume
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
GMRES(100) - Averous/epb3 - 10 faults - 0.001% data loss
L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Impact of lost data volume
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
GMRES(100) - Averous/epb3 - 10 faults - 0.001% data loss
L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Impact of lost data volume
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
GMRES(100) - Averous/epb3 - 10 faults - 0.001% data loss
L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Impact of lost data volume
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
ER
GMRES(100) - Averous/epb3 - 10 faults - 0.001% data loss
L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Impact of lost data volume
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
ER
GMRES(100) - Averous/epb3 - 10 faults - 0.2% data loss
L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Impact of lost data volume
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
ER
GMRES(100) - Averous/epb3 - 10 faults - 0.8% data loss
L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Impact of lost data volume
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
ER
GMRES(100) - Averous/epb3 - 10 faults - 3% data loss
L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Experimental results
L. Giraud - On Numerical Resilience in Linear Algebra 11 / 31
Fault rate impact
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Impact of fault rate
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 140 280 420 560 700 840 980 1120 1260 1400
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
ER
GMRES(100) - Kim - 4 faults - 0.2% data loss
L. Giraud - On Numerical Resilience in Linear Algebra 12 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Impact of fault rate
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 140 280 420 560 700 840 980 1120 1260 1400
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
ER
GMRES(100) - Kim - 8 faults - 0.2% data loss
L. Giraud - On Numerical Resilience in Linear Algebra 12 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Impact of fault rate
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 140 280 420 560 700 840 980 1120 1260 1400
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
ER
GMRES(100) - Kim - 17 faults - 0.2% data loss
L. Giraud - On Numerical Resilience in Linear Algebra 12 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Impact of fault rate
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 140 280 420 560 700 840 980 1120 1260 1400
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
ER
GMRES(100) - Kim - 40 faults - 0.2% data loss
L. Giraud - On Numerical Resilience in Linear Algebra 12 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Penalty of restart strategy on PCG
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 11 22 33 44 55 66 77 88 99 110
A-n
orm
(err
or)
Iteration
Reset
LI
LSI
ER
PCG (Cunningham/qa8fm - 9 faults)
L. Giraud - On Numerical Resilience in Linear Algebra 13 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Penalty of restart strategy on PCG
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 11 22 33 44 55 66 77 88 99 110
A-n
orm
(err
or)
Iteration
Reset
LI
LSI
ER
NF
PCG (Cunningham/qa8fm - 9 faults)
L. Giraud - On Numerical Resilience in Linear Algebra 13 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Penalty of restart strategy on GMRES
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 140 280 420 560 700 840 980 1120 1260 1400
||(b
-Ax)
||/||b
||
Iteration
Reset
LI
LSI
ER
Preconditioned GMRES(100) (Kim1 - 8 faults)
L. Giraud - On Numerical Resilience in Linear Algebra 14 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Penalty of restart strategy on GMRES
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 140 280 420 560 700 840 980 1120 1260 1400
||(b
-Ax)
||/||b
||
Iteration
Reset
LI
LSI
ER
NF
Preconditioned GMRES(100) (Kim1 - 8 faults)
L. Giraud - On Numerical Resilience in Linear Algebra 14 / 31
Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers
Summary on resilient Krylov linear solvers
Interpolation-Restart strategies key features:
Make sense: key properties of CG and GMRES are preserved
The restarting effect remains reasonable within the restartedGMRES context
Robust even with a high fault rate and a high volume of lost data
No fault implies no overhead
Extended to multiple faults
L. Giraud - On Numerical Resilience in Linear Algebra 15 / 31
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
L. Giraud - On Numerical Resilience in Linear Algebra 16 / 31
uA
λ=
u Au = λu with u 6= 0, where A ∈ Cn×n,u ∈ Cn, and λ ∈ C
λ : eigenvalueu : eigenvector(λ, u) : eigenpair
Two classes of methodsFixed Point Methods (Power Method, Subspace iteration)Subpace Methods (Jacobi-Davidson, Arnoldi, IRA/KrylovSchur)
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
Interpolation strategies
Fault in eigenproblem(A11 A12A21 A22
)(u1u2
)= λ
(u1u2
)
Linear Interpolation (LI)(A11 − λI1
)u1 = −A12u2
Least Squares Interpolation (LSI)(A11A21
)x1 +
(A21A22
)u2 = λ
(u1u2
)u1 = argmin
u
∥∥∥∥(A11 − λI1A21
)u +
(A12
A22 − λI2
)u2
∥∥∥∥2
L. Giraud - On Numerical Resilience in Linear Algebra 17 / 31
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
Interpolation strategies
Fault in eigenproblem(A11 A12A21 A22
)(?u2
)= λ
(?u2
)u1 entries are lostλ is replicated
Linear Interpolation (LI)(A11 − λI1
)u1 = −A12u2
Least Squares Interpolation (LSI)(A11A21
)x1 +
(A21A22
)u2 = λ
(u1u2
)u1 = argmin
u
∥∥∥∥(A11 − λI1A21
)u +
(A12
A22 − λI2
)u2
∥∥∥∥2
L. Giraud - On Numerical Resilience in Linear Algebra 17 / 31
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
Eigensolvers revisited
Basic subspace iteration method
Subspace iteration with polynomial acceleration
Arnoldi method
Implicitly Restarted Arnoldi method
Jacobi-Davidson method (JDQR)
L. Giraud - On Numerical Resilience in Linear Algebra 18 / 31
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
Eigensolvers revisited
Basic subspace iteration method
Subspace iteration with polynomial acceleration
Arnoldi method
Implicitly Restarted Arnoldi method
Jacobi-Davidson method (JDQR)
L. Giraud - On Numerical Resilience in Linear Algebra 18 / 31
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
Jacobi-Davidson at a glanceJDQR algorithm
Compute a few nev eigenpairs associated with eigenvalues closed to a target τ
Perform a partial Schur decomposition AZnconv = ZnconvTnconv
The next Schur vectors are searched in the Vk search space
L. Giraud - On Numerical Resilience in Linear Algebra 19 / 31
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
1e+01
0 24 48 72 96 120 144 168 192 216 240
||(A
x -
lam
bda*x
)||/||la
mbda||
Iteration
NF
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
Resilient Jacobi-Davidson algorithm
L. Giraud - On Numerical Resilience in Linear Algebra 20 / 31
Converged Schur vectorsAZnconv = ZnconvTnconvInterpolate the nconv converged vectors
Search spaceCk = V H
k AVkInterpolate a few s best candidates
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
1e+01
0 24 48 72 96 120 144 168 192 216 240
||(A
x -
la
mb
da
*x)|
|/||la
mb
da
||
Iteration
NF
JD is restarted with a search space of nconv + s vectorsFlexibility to select the dimension of the search space for restarting
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
Experimental framework
L. Giraud - On Numerical Resilience in Linear Algebra 21 / 31
Thermoacoustic instabilities in combustion chambers [Image courtesy of CERFACS]
Damaged rocket chamber from combustion instabilities
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
Experimental framework
L. Giraud - On Numerical Resilience in Linear Algebra 21 / 31
Thermoacoustic instabilities in combustion chambers [Image courtesy of CERFACS]
Simulation of combustion instabilities in annular chambers
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
Experimental framework
L. Giraud - On Numerical Resilience in Linear Algebra 21 / 31
Thermoacoustic instabilities in combustion chambers
Compute eigenpairs associated with smallest magnitude eigenvalueslaying in the periphery of the spectrum
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
Experimental results
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
1e+01
0 24 48 72 96 120 144 168 192 216 240
||(A
x -
lam
bda*x
)||/||la
mbda||
Iteration
NF
Convergence of 5 eigenpairs associated with smallest magnitudeeigenvalues
L. Giraud - On Numerical Resilience in Linear Algebra 22 / 31
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
Experimental results
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
1e+01
0 24 48 72 96 120 144 168 192 216 240
||(A
x -
lam
bda*x
)||/||la
mbda||
Iteration
0 0 0 2 2 4
LSI
Restart with nconv vectors from converged Schur vectors combined with5 best candidates from the search space (6 faults)
L. Giraud - On Numerical Resilience in Linear Algebra 22 / 31
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
Experimental results
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
1e+01
0 24 48 72 96 120 144 168 192 216 240
||(A
x -
lam
bda*x
)||/||la
mbda||
Iteration
0 0 0 2 2 4
LSI
NF
Restart with nconv vectors from converged Schur vectors combined with5 best candidates from the search space (6 faults)
L. Giraud - On Numerical Resilience in Linear Algebra 22 / 31
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
Experimental results
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
1e+01
0 24 48 72 96 120 144 168 192 216 240
||(A
x -
lam
bda*x
)||/||la
mbda||
Iteration
0 0 0 2 3 4
LSI
NF
Impact of keeping the best Schur vector candidate in the searchspace after a fault (6 faults)
L. Giraud - On Numerical Resilience in Linear Algebra 22 / 31
Hard fault: Interpolation-Restart strategies In revisited eigensolvers
Summary on resilient eigensolvers
Specific Interpolation-Restart strategies for each eigensolverI Basic subspace iteration method
I Subspace iteration with polynomial acceleration
I Arnoldi method
I IRAM
I Jacobi-Davidson
More flexibility in eignesolver than in Krylov linear solvers
L. Giraud - On Numerical Resilience in Linear Algebra 23 / 31
Soft errors in Conjugate Gradient (preliminary)
Outline
1. Hard fault: Interpolation-Restart strategiesIn Krylov subspace linear solversIn revisited eigensolvers
2. Soft errors in Conjugate Gradient (preliminary)
L. Giraud - On Numerical Resilience in Linear Algebra 24 / 31
Soft errors in Conjugate Gradient (preliminary)
Aim of this study
ContextSoft Errors in Conjugate Gradient (CG)
Question 1Impact of soft errors on convergence of CG
Question 2Reliability of numerical detection mechanisms ?
Question 3Robust numerical recovery schemes ?
Method to address question 1Experimental study with basic statistics
L. Giraud - On Numerical Resilience in Linear Algebra 25 / 31
Soft errors in Conjugate Gradient (preliminary)
Aim of this study
ContextSoft Errors in Conjugate Gradient (CG)
Question 1Impact of soft errors on convergence of CG
Question 2Reliability of numerical detection mechanisms ?
Question 3Robust numerical recovery schemes ?
Method to address question 1Experimental study with basic statistics
L. Giraud - On Numerical Resilience in Linear Algebra 25 / 31
Soft errors in Conjugate Gradient (preliminary)
Aim of this study
ContextSoft Errors in Conjugate Gradient (CG)
Question 1Impact of soft errors on convergence of CG
Question 2Reliability of numerical detection mechanisms ?
Question 3Robust numerical recovery schemes ?
Method to address question 1Experimental study with basic statistics
L. Giraud - On Numerical Resilience in Linear Algebra 25 / 31
Soft errors in Conjugate Gradient (preliminary)
Aim of this study
ContextSoft Errors in Conjugate Gradient (CG)
Question 1Impact of soft errors on convergence of CG
Question 2Reliability of numerical detection mechanisms ?
Question 3Robust numerical recovery schemes ?
Method to address question 1Experimental study with basic statistics
L. Giraud - On Numerical Resilience in Linear Algebra 25 / 31
Soft errors in Conjugate Gradient (preliminary)
Aim of this study
ContextSoft Errors in Conjugate Gradient (CG)
Question 1Impact of soft errors on convergence of CG
Question 2Reliability of numerical detection mechanisms ?
Question 3Robust numerical recovery schemes ?
Method to address question 1Experimental study with basic statistics
L. Giraud - On Numerical Resilience in Linear Algebra 25 / 31
Soft errors in Conjugate Gradient (preliminary)
Fault injection
L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31
Soft errors in Conjugate Gradient (preliminary)
Fault injection
L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31
Soft errors in Conjugate Gradient (preliminary)
Fault injection
L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31
Soft errors in Conjugate Gradient (preliminary)
Fault injection
L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31
Soft errors in Conjugate Gradient (preliminary)
Fault injection
L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31
Soft errors in Conjugate Gradient (preliminary)
Fault injection
L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31
Soft errors in Conjugate Gradient (preliminary)
Fault injection
L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31
Soft errors in Conjugate Gradient (preliminary)
Fault injection
L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31
Soft errors in Conjugate Gradient (preliminary)
Transient fault injection
L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31
Soft errors in Conjugate Gradient (preliminary)
Transient fault injection
L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31
Soft errors in Conjugate Gradient (preliminary)
Parameters for Fault Injection
Parameter ExplanationType Type of the Conjugate Gradient Method [Classical, Chrono, Pipelined]ID Matrix ID [1 : 30]Time Fault injection iteration as percentage [10%,. . . ,90%]k Magnitude of fault injection [-16:2:16]Location Location of fault injection 50 random
[A. T. Chronopoulos, C. W. Gear - JCAM, 1989][P. Ghysels, W. Vanroose - ParCo, 2014]
L. Giraud - On Numerical Resilience in Linear Algebra 27 / 31
Soft errors in Conjugate Gradient (preliminary)
Protocol: correct convergence
L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31
Soft errors in Conjugate Gradient (preliminary)
Protocol: correct convergence
L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31
Soft errors in Conjugate Gradient (preliminary)
Protocol: correct convergence
L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31
Soft errors in Conjugate Gradient (preliminary)
Protocol: correct convergence
L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31
Soft errors in Conjugate Gradient (preliminary)
Protocol: correct convergence
L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31
Soft errors in Conjugate Gradient (preliminary)
Protocol: correct convergence
L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31
Soft errors in Conjugate Gradient (preliminary)
Protocol: correct convergence
L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31
Soft errors in Conjugate Gradient (preliminary)
Protocol: correct convergence
L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31
Soft errors in Conjugate Gradient (preliminary)
Protocol: correct convergence
L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31
Soft errors in Conjugate Gradient (preliminary)
Characterize effect of magnitude (k)
0.00
0.25
0.50
0.75
1.00
−16 −14 −12 −10 −8 −4 −2 0 2 4 6 8 10 12 14 16Magnitude (k)
Acc
epta
ble
Cas
es
Variant of CG
Classical
Chrono/Gear
Pipeline
L. Giraud - On Numerical Resilience in Linear Algebra 29 / 31
Soft errors in Conjugate Gradient (preliminary)
Characterize effect of fault injection time
0.25
0.50
0.75
10 20 30 40 50 60 70 80 90Fault Injection Time
Acc
epta
ble
Cas
es
Variant of CG
Classical
Chrono/Gear
Pipeline
L. Giraud - On Numerical Resilience in Linear Algebra 30 / 31
Soft errors in Conjugate Gradient (preliminary)
Summary
Qualitive observationsClassical CG and Chronopolus/Gear CG are less sensitive forsoft errors than Pipelined CG.
Future WorkExtend the study to GMRESDetection based on checksum mechanisms to protect thesparse marix-vector productDetection mechanisms based on round-off error analysis(additional benefit : insight on mixed precision calculation)
L. Giraud - On Numerical Resilience in Linear Algebra 31 / 31
Acknowlegement for financial support:
French ANR: RESCUE projectEuropean FP7 : Exa2CT projectG8 : ECS project
http://hiepacs.bordeaux.inria.fr/
Merci for your attentionQuestions ?
Acknowlegement for financial support:
French ANR: RESCUE projectEuropean FP7 : Exa2CT projectG8 : ECS project
http://hiepacs.bordeaux.inria.fr/