Boosting Scalability ofBoosting Scalability of InfiniBand ...
Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations...
Transcript of Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations...
![Page 1: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/1.jpg)
Scalability and Algorithm-Based Fault Tolerance forPlasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM
Dirk PflugerSimulation of Large Systems, IPVS/SimTech, Universitat Stuttgart
(joint work with M. Heene, A. Hinojosa)
February 2, 2017
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 1
![Page 2: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/2.jpg)
PDE: Turbulence simulations of hot fusion plasmas
[source: ITER project]
Idea: new, CO2-free source of energy for the generations to come
EXAHD with H.-J. Bungartz (TUM), M. Griebel (Bonn), T. Dannert(RZG), F. Jenko (UCLA)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 2
![Page 3: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/3.jpg)
Practically Unlimited Ressources
Contents:
Deuterium in bath tub full of water and Lithium in used laptop batterysuffice for family over 50 years
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 3
![Page 4: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/4.jpg)
Behind the Scenes
Dilute/hot plasmas are (almost) collisionlessNot magneto-hydrodynamic, but kinect description (Vlasov):[
∂
∂t+ ~v
∂
∂~x+
qm
(E +
~vc× B
)∂
∂~v
]f (~x , ~v , t) = 0
Distribution function f (~x , ~v , t)6D in state spaceCoupled to Maxwell equations
Gyrokinetics: remove fast gyromotion(smallest scale)[
∂
∂t+ ~v · ∂
∂~x+ F
∂
∂v||
]f (~x , v||, µ, t) = ∆(f )
5D~v and F are complex expressions, contain evaluation of E and B
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 4
![Page 5: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/5.jpg)
Behind the Scenes
Dilute/hot plasmas are (almost) collisionlessNot magneto-hydrodynamic, but kinect description (Vlasov):[
∂
∂t+ ~v
∂
∂~x+
qm
(E +
~vc× B
)∂
∂~v
]f (~x , ~v , t) = 0
Distribution function f (~x , ~v , t)6D in state spaceCoupled to Maxwell equations
Gyrokinetics: remove fast gyromotion(smallest scale)[
∂
∂t+ ~v · ∂
∂~x+ F
∂
∂v||
]f (~x , v||, µ, t) = ∆(f )
5D~v and F are complex expressions, contain evaluation of E and B
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 4
![Page 6: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/6.jpg)
Numerical Simulations for Actual Tokamaks with GENE
Aim: global simulations of ITER
http://www.genecode.orgDirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 5
![Page 7: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/7.jpg)
Numerical Simulations for Actual Tokamaks with GENE
Goal: global simulation with physical realism
Szenario for simulation of “numerical ITER”Global, non-linear runsAt least 1011 grid points, 106 time steps>1 TB just to store single result in memory (complex)
Possible at all?
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 5
![Page 8: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/8.jpg)
Sparse Grids – Hierarchical ApproachHigh-dimensional problems suffer “curse of dimensionality”
Effort O((2n)d )⇒ too Big Data
Therefore: hierarchical discretizationSparse grids: O(2n · nd−1) [Zenger 91]Makes high-dimensional discretizations possible
full grid
sparse grid sg combination technique
5d, level 10 > 1015
25,416,705 1,876 × 82,000
l1=1 l1=2 l1=3 l1
l2=1
l2=2
l2=3
l2
l1=4
l2=4
–+
Combination technique (multivariate extrapolation-style scheme)Multiple, but smaller grids: O(d · nd−1) problems of size O(2n)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 6
![Page 9: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/9.jpg)
Sparse Grids – Hierarchical ApproachHigh-dimensional problems suffer “curse of dimensionality”
Effort O((2n)d )⇒ too Big DataTherefore: hierarchical discretization
Sparse grids: O(2n · nd−1) [Zenger 91]Makes high-dimensional discretizations possible
full grid sparse grid
sg combination technique
5d, level 10 > 1015 25,416,705
1,876 × 82,000
l1=1 l1=2 l1=3 l1
l2=1
l2=2
l2=3
l2
l1=4
l2=4
–+
Combination technique (multivariate extrapolation-style scheme)Multiple, but smaller grids: O(d · nd−1) problems of size O(2n)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 6
![Page 10: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/10.jpg)
Sparse Grids – Hierarchical ApproachHigh-dimensional problems suffer “curse of dimensionality”
Effort O((2n)d )⇒ too Big DataTherefore: hierarchical discretization
Sparse grids: O(2n · nd−1) [Zenger 91]Makes high-dimensional discretizations possible
full grid sparse grid sg combination technique
5d, level 10 > 1015 25,416,705 1,876 × 82,000
l1=1 l1=2 l1=3 l1
l2=1
l2=2
l2=3
l2
l1=4
l2=4
–+
Combination technique (multivariate extrapolation-style scheme)Multiple, but smaller grids: O(d · nd−1) problems of size O(2n)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 6
![Page 11: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/11.jpg)
Sparse Grid vs. Combination Technique
l1=1 l1=2 l1=3 l1
l2=1
l2=2
l2=3
l2
l1=4
l2=4
l1=1 l1=2 l1=3 l1
l2=1
l2=2
l2=3
l2
l1=4
l2=4
–+
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 7
![Page 12: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/12.jpg)
Overview
1 Motivation and Numerics
2 Scalability
3 Algorithm-Based Fault ToleranceHard FaultsSilent/Soft Faults
4 Summary
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 8
![Page 13: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/13.jpg)
Scalability
Problem of standard solver: global communication within each time-step
Use hierarchical ansatzTwo-level approach
Numerics: decoupling into locally coupled problems
Algorithms: second level of parallelism
First level: no need to scale to exascale
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 9
![Page 14: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/14.jpg)
Scalability
Problem of standard solver: global communication within each time-step
Use hierarchical ansatzTwo-level approach
Numerics: decoupling into locally coupled problems
Algorithms: second level of parallelism
First level: no need to scale to exascale
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 9
![Page 15: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/15.jpg)
Time-Dependent PDEs
Sparse Grid
compute combine compute
Sparse Grid
combine
component grid
component grid
component grid
component grid
component grid
component grid
component grid
component grid
component grid
component grid
component grid
component grid
Gather-scatter steps every time-interval
Remaining reduced global communication
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 10
![Page 16: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/16.jpg)
Global Communication
Optimal communication schemes
hierarchize add
dehierarchize extract
global reduce
distributed
full grid
distributed sparse grid
each component grid
each component grid
each process group
each process group
distributed
hierarchized full grid
distributed
full grid
distributed
hierarchized full grid
global communication
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 11
![Page 17: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/17.jpg)
Global Communication
Minimize number of communications (Range Query Trees):
O(log(dnd−1))×O(2nnd−1)
Minimize package size
O(2n · nd−1)×O(2n−1)
Derivation in BSP/PEM model
1e-05
0.0001
0.001
0.01
0.1
1
10
100
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Tim
e [s
]
Sparse Grid Level n
Hermit: d = 3, no boundary, min. level fixedSG Red.SG Red.
Subsp. Red.Subsp. Red.
Non-block. Subsp. Red.Parallel Subsp. Red.Parallel Subsp. Red.
Non-block. Parallel Subsp. Red.
[joint work with R. Jacob (ITU, Algorithm Engineering)]
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 11
![Page 18: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/18.jpg)
Runtimes on Hazel Hen
104 105
total #processes
100
101
102
runti
me
[s]
hierarchizationnprocs 1024
nprocs 2048
nprocs 4096
nprocs 8192
104 105
total #processes
10−2
10−1
100local reduction
nprocs 1024
nprocs 2048
nprocs 4096
nprocs 8192
4096 8192 16384 32768 65536 180224
total #processes
10−1
100
101
runti
me
[s]
global reduction
nprocs 1024
nprocs 2048
nprocs 4096
nprocs 8192
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 12
![Page 19: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/19.jpg)
Runtimes on Hazel Hen
Total time
4096 8192 16384 32768 65536 180224
total #processes
100
101
102
103
runti
me
[s]
hierarchization + loc. reduction + glob. reduction
nprocs 1024
nprocs 2048
nprocs 4096
nprocs 8192
GENE 1 time step
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 12
![Page 20: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/20.jpg)
Overview
1 Motivation and Numerics
2 Scalability
3 Algorithm-Based Fault ToleranceHard FaultsSilent/Soft Faults
4 Summary
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 13
![Page 21: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/21.jpg)
Resilience for the Exa-Age
Ever decreasing mean time between failureMassive replication of hardware
Smaller scales (higher integration)
Hardware possibly with less checks
. . .
Two categories:1 Hard faults2 Soft/silent faults
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 14
![Page 22: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/22.jpg)
Resilience for the Exa-Age
Ever decreasing mean time between failureMassive replication of hardware
Smaller scales (higher integration)
Hardware possibly with less checks
. . .
Two categories:1 Hard faults2 Soft/silent faults
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 14
![Page 23: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/23.jpg)
Hard Faults
Errors that trigger signals to the userNode, OS, network or process failureSoftware crashes
⇒ Default MPI response: abort application
SolutionsRecompute (checkpoint-restart)
Checkpoint on HD / RAMLosslessExpensive storage/communication operationsRestart even more expensive
Continue w/o recomputationRequires adapted numerical schemesNo/minor extra computational effortLossy
⇒ algorithm-based fault-tolerance (ABFT)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 15
![Page 24: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/24.jpg)
Hard Faults
Errors that trigger signals to the userNode, OS, network or process failureSoftware crashes
⇒ Default MPI response: abort application
SolutionsRecompute (checkpoint-restart)
Checkpoint on HD / RAMLosslessExpensive storage/communication operationsRestart even more expensive
Continue w/o recomputationRequires adapted numerical schemesNo/minor extra computational effortLossy
⇒ algorithm-based fault-tolerance (ABFT)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 15
![Page 25: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/25.jpg)
Hard Faults
Errors that trigger signals to the userNode, OS, network or process failureSoftware crashes
⇒ Default MPI response: abort application
SolutionsRecompute (checkpoint-restart)
Checkpoint on HD / RAMLosslessExpensive storage/communication operationsRestart even more expensive
Continue w/o recomputationRequires adapted numerical schemesNo/minor extra computational effortLossy
⇒ algorithm-based fault-tolerance (ABFT)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 15
![Page 26: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/26.jpg)
Communication Scheme
Master-worker model
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 16
![Page 27: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/27.jpg)
Software Stack
Fault simulation layer
Implements interface of ULFMplus kill_me() functionality
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 17
![Page 28: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/28.jpg)
Selective Reliability
Focus on critical parts
Algorithm: The Combination Technique in Parallel
for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions
while not converged do
for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)
u(c)n ← reduce(ciui); // Combine solutions
for all i ∈ In,q,τ doui← scatter(u(c)
n ); // Sample each uifrom new u(c)n
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 18
![Page 29: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/29.jpg)
Selective Reliability
Focus on critical parts
Algorithm: The Combination Technique in Parallel
for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions
while not converged do
for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)
u(c)n ← reduce(ciui); // Combine solutions
for all i ∈ In,q,τ doui← scatter(u(c)
n ); // Sample each uifrom new u(c)n
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 18
![Page 30: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/30.jpg)
Selective Reliability
Focus on critical parts
Algorithm: The Combination Technique in Parallel
for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions
while not converged do
for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)
mitigateFaults(); // Mitigate faults
u(c)n ← reduce(ciui); // Combine solutions
for all i ∈ In,q,τ doui← scatter(u(c)
n ); // Sample each uifrom new u(c)n
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 18
![Page 31: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/31.jpg)
2D Example
1
2
3
61 2 3 4 5
6
5
4
2
1
l
l
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 19
![Page 32: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/32.jpg)
2D Example
1
2
3
4
5
1 2 3 4 5 6
6
1
2i
i
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 19
![Page 33: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/33.jpg)
ABFT: Fault-Tolerant Combination Technique
Find alternative combination, exclude missing solutionsStarting point: standard CT coefficients
uc~n(~x) =
d−1∑q=0
(−1)q(
d − 1q
) ∑~l∈I~n,q
u~l (~x)
In case of failure: use inclusion-exclusion principle to determine adaptedcombination
1 Solve generalized coefficient problem (GCP):
maxw
Q′(w), Q′(w) :=∑l∈I↓
4−‖i‖1 wl , s.t. wl ∈ 0, 1 ∀l ∈ I ↓
2 Obtain new combination coefficients:
cl = (M−1w)l
Extra computations only on lower scales required
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 20
![Page 34: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/34.jpg)
ABFT: Fault-Tolerant Combination Technique
Find alternative combination, exclude missing solutionsStarting point: standard CT coefficients
uc~n(~x) =
d−1∑q=0
(−1)q(
d − 1q
) ∑~l∈I~n,q
u~l (~x)
In case of failure: use inclusion-exclusion principle to determine adaptedcombination
1 Solve generalized coefficient problem (GCP):
maxw
Q′(w), Q′(w) :=∑l∈I↓
4−‖i‖1 wl , s.t. wl ∈ 0, 1 ∀l ∈ I ↓
2 Obtain new combination coefficients:
cl = (M−1w)l
Extra computations only on lower scales required
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 20
![Page 35: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/35.jpg)
ABFT: Fault-Tolerant Combination Technique
Find alternative combination, exclude missing solutionsStarting point: standard CT coefficients
uc~n(~x) =
d−1∑q=0
(−1)q(
d − 1q
) ∑~l∈I~n,q
u~l (~x)
In case of failure: use inclusion-exclusion principle to determine adaptedcombination
1 Solve generalized coefficient problem (GCP):
maxw
Q′(w), Q′(w) :=∑l∈I↓
4−‖i‖1 wl , s.t. wl ∈ 0, 1 ∀l ∈ I ↓
2 Obtain new combination coefficients:
cl = (M−1w)l
Extra computations only on lower scales required
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 20
![Page 36: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/36.jpg)
ABFT: Fault-Tolerant Combination Technique
Find alternative combination, exclude missing solutionsStarting point: standard CT coefficients
uc~n(~x) =
d−1∑q=0
(−1)q(
d − 1q
) ∑~l∈I~n,q
u~l (~x)
In case of failure: use inclusion-exclusion principle to determine adaptedcombination
1 Solve generalized coefficient problem (GCP):
maxw
Q′(w), Q′(w) :=∑l∈I↓
4−‖i‖1 wl , s.t. wl ∈ 0, 1 ∀l ∈ I ↓
2 Obtain new combination coefficients:
cl = (M−1w)l
Extra computations only on lower scales required
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 20
![Page 37: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/37.jpg)
GCP: 2D Example
1
2
3
4
5
1 2 3 4 5 6
6
1
2i
i
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 21
![Page 38: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/38.jpg)
GCP: 2D Example
1
2
3
4
5
6
61 2 3 4 5 1
2
i
i
1
2
3
4
5
1 2 3 4 5 6
6
1
2i
i
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 21
![Page 39: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/39.jpg)
GCP: 2D Example
1
2
3
4
5
6
61 2 3 4 5 1
2
i
i
1
2
3
4
5
1 2 3 4 5 6
6
1
2
i
i
1
2
3
4
5
1 2 3 4 5 6
6
1
2i
i
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 21
![Page 40: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/40.jpg)
GCP: 2D Example
1
2
3
4
5
6
61 2 3 4 5 1
2
i
i
1
2
3
4
5
1 2 3 4 5 6
6
1
2
i
i
1
2
3
4
5
1 2 3 4 5 6
6
1
2
i
i
1
2
3
4
5
1 2 3 4 5 6
6
1
2i
i
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 21
![Page 41: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/41.jpg)
GCP: Example 3D
No faults
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 22
![Page 42: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/42.jpg)
GCP: Example 3D
2 faults
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 22
![Page 43: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/43.jpg)
GCP: Example 3D
For arbitrary faults: GCP prohibitively expensive
Fast solution possible if enough extra grid layers available
Only fraction of computational effort⇒ faults in lower layers unlikely
Precompute some extra layers in advance
0 1 2 3 4
q0
5
10
15
20
25
#gr
ids
21
15
10
6
3
CT
Extra
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 23
![Page 44: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/44.jpg)
Results Using GENE
Example:
Good reconstruction (visual inspection)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 24
![Page 45: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/45.jpg)
Results Using GENE (2)
Small (reduced) problem4D: x , z, µ, v‖~lmin = [2, 3, 2, 4],~l = [6, 7, 6, 8]⇒ 69 combination grids
5.0% 10.0% 15.0% 20.0% 25.0%
Faults
10−6
10−5
10−4
10−3
10−2
L2
erro
rn
orm
Excellent recovery properties!
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 25
![Page 46: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/46.jpg)
Computational Effort
Accumulated timeto compute partial grids
0 1 2 3 4
q
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
tim
e(s
)
101
CT
Extra
×
Gain by ABFT
1 2 3 4 5
Faults
0
1
2
3
4
5
6
tim
e(s
)
Restarting from checkpoint
With recombination
Significant savings in runtime
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 26
![Page 47: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/47.jpg)
Silent/Soft Faults
No signal to userFaults unnoticed unless searched forMost common type: Silent Data Corruption (SDC)Errors in arithmetic operations, memory corruption, bit flips
1 0 1 0 1 1 1 0 1
1 0 1 0 0 1 1 0 1
Common solutionsChecksumsReplication (process/data)
⇒ Significant overhead (effort, resources)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 27
![Page 48: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/48.jpg)
Silent/Soft Faults
No signal to userFaults unnoticed unless searched forMost common type: Silent Data Corruption (SDC)Errors in arithmetic operations, memory corruption, bit flips
1 0 1 0 1 1 1 0 1
1 0 1 0 0 1 1 0 1
Common solutionsChecksumsReplication (process/data)
⇒ Significant overhead (effort, resources)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 27
![Page 49: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/49.jpg)
Selective Reliability
Focus on critical parts
Algorithm: The Combination Technique in Parallel
for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions
while not converged do
for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)
mitigateFaults(); // Mitigate faults
u(c)n ← reduce(ciui); // Combine solutions
for all i ∈ In,q,τ doui← scatter(u(c)
n ); // Sample each uifrom new u(c)n
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 28
![Page 50: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/50.jpg)
Selective Reliability
Focus on critical parts
Algorithm: The Combination Technique in Parallel
for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions
while not converged do
for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)
checkForSDC(); // Cheap sanity check
mitigateFaults(); // Mitigate faults
u(c)n ← reduce(ciui); // Combine solutions
for all i ∈ In,q,τ doui← scatter(u(c)
n ); // Sample each uifrom new u(c)n
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 28
![Page 51: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/51.jpg)
Silent/Soft Faults
Exploit hierarchical approachSimilar discretizations lead to similar results
Exploit redundancy and hierarchical representation to check for faults
Detection of outliers possible
Direct integration into communication schemes possible (SubspaceReduce)
−
+
Component
Grids
Sparse
GridHierarchical Increment Spaces
of the Sparse Grid
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 29
![Page 52: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/52.jpg)
SDC Check: Compare Pairs of Solutions
Similar discretizations should lead to similar results
Technische Universitat Munchen • Department of Informatics • Chair of Scientific Computing
SDC Check 1: Compare pairs of solutions
1
2
3
4
1 2 3 4
Pair β(s,t)
(2, 4) (2, 3) 3.98e-01(3, 2) (2, 4) 1.11e+00(4, 2) (2, 3) 1.11e+00(3, 2) (2, 3) 6.32e-01(3, 3) (4, 2) 9.85e+05(3, 3) (2, 3) 1.07e+06(3, 2) (4, 2) 3.98e-01(2, 4) (4, 2) 1.27e+00(3, 2) (3, 3) 1.07e+06(2, 4) (3, 3) 9.85e+05
β(s,t) := maxl≤s∧t
maxj∈Il
|α(t)l,j − α
(s)l,j |
min|α(t)
l,j |, |α(s)l,j |
A. Parra Hinojosa: SDC-Resilient Algorithms Using the Sparse Grid Combination Technique
SPPEXA 2016 12
Pair β(s,t)
(2, 4) (2, 3) 3.98e-01(3, 2) (2, 4) 1.11e+00(4, 2) (2, 3) 1.11e+00(3, 2) (2, 3) 6.32e-01(3, 3) (4, 2) 9.85e+05(3, 3) (2, 3) 1.07e+06(3, 2) (4, 2) 3.98e-01(2, 4) (4, 2) 1.27e+00(3, 2) (3, 3) 1.07e+06(2, 4) (3, 3) 9.85e+05
β(s,t) := maxl≤s∧t
maxj∈Il
|α(t)l,j − α
(s)l,j |
min|α(t)
l,j |, |α(s)l,j |
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 30
![Page 53: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/53.jpg)
SDC Check: Outlier detection
Technische Universitat Munchen • Department of Informatics • Chair of Scientific Computing
SDC Check 2: Outlier detection
u(0,0) = [1.002,5.356,0.998,1.002,1.001,1.001, .999]
A. Parra Hinojosa: SDC-Resilient Algorithms Using the Sparse Grid Combination Technique
SPPEXA 2016 13
u(0, 0) = [1.002, 5.356, 0.998, 1.002, 1.001, 1.001, .999]
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 31
![Page 54: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/54.jpg)
2D Example
Advection equation
∂u∂t
+ cx∂u∂x
+ cy∂u∂x
= 0 Ω = [0, 1]2
Periodic boundary conditions
Constant advection velocities cx , cy
Initial condition u(x , y , t = 0) = sin(2πx) sin(2πy)
Lax-Wendroff scheme (2nd order space + time)
Error/solution at t = 0.5 compared to analytical solution
u(x , y , t) = sin(2π(x − cx t)) sin(2π(y − cY t))
Corruption of one single data point in initial condition
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 32
![Page 55: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/55.jpg)
2D Example
0 5 10 15 20 25 300
5
10
15
20
25
30
Exact solution
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
0 5 10 15 20 25 300
5
10
15
20
25
30
Full Grid
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
0 5 10 15 20 25 300
5
10
15
20
25
30
Combined grid
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 33
![Page 56: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/56.jpg)
2D Example
0 5 10 15 20 25 300
5
10
15
20
25
30
Exact solution
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
0 5 10 15 20 25 300
5
10
15
20
25
30
Full Grid
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
0 5 10 15 20 25 300
5
10
15
20
25
30
Combined grid
−2.4
−1.8
−1.2
−0.6
0.0
0.6
1.2
1.8
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 33
![Page 57: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/57.jpg)
2D Example
0 5 10 15 20 25 300
5
10
15
20
25
30
Exact solution
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
0 5 10 15 20 25 300
5
10
15
20
25
30
Full Grid
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
0 5 10 15 20 25 300
5
10
15
20
25
30
Combined grid
−2.4
−1.8
−1.2
−0.6
0.0
0.6
1.2
1.8
0 5 10 15 20 25 300
5
10
15
20
25
30
Recovered
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 33
![Page 58: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/58.jpg)
2D Example: Simulated Soft Faults
Inserting one soft faultMeasuring L2-error at the end
0 100 200 300 400 50010−510−410−310−210−1
100101102103104105106
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 105
Full Grid CT, no SDC CT, with SDC CT, recovered
0 100 200 300 400 50010−5
10−4
10−3
10−2
10−1
100
101
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 10−0.5
0 100 200 300 400 500
Timestep where fault occurs, lowest hierarchical subspace
10−5
10−4
10−3
10−2
10−1
100
101
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 10−300
0 100 200 300 400 50010−510−410−310−210−1
100101102103104105106
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 105
0 100 200 300 400 50010−5
10−4
10−3
10−2
10−1
100
101
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 10−0.5
0 100 200 300 400 500
Timestep where fault occurs, highest hierarchical subspace
10−5
10−4
10−3
10−2
10−1
100
101
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 10−300
0 100 200 300 400 50010−510−410−310−210−1
100101102103104105106
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 105
0 100 200 300 400 50010−5
10−4
10−3
10−2
10−1
100
101
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 10−0.5
0 100 200 300 400 500
Timestep where fault occurs, highest hierarchical subspace
10−5
10−4
10−3
10−2
10−1
100
101
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 10−300
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 34
![Page 59: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/59.jpg)
Higher-D: Advection-Diffusion Equation
∂tu −∆u + ~a · ∇u = f in Ω× [0,T )
u(·, t) = 0 in ∂Ω
u(·, 0) = u0 in Ω
Ω = [0, 1]d ,~a = (1, . . . , 1)T , u0 = e−100∑d
i=1(xi−0.5)2
Implemented in DUNE-pdelab
FVM, explicit time integration
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 35
![Page 60: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/60.jpg)
Results
Fault in second time stepRelative error w.r.t. full-grid solution (n = 11 in 2D, n = 7 in 5D)Computations on Hazel Hen (HLRS)2D, 5D:
(6,6) (7,7) (8,8) (9,9) (10,10)
n
10-2
10-1
l 2-e
rror
no faults
2 groups (~50% faults)
4 groups (~25% faults)
8 groups (~12% faults)
16 groups (~6% faults)
(4,4,4,4,4) (5,5,5,5,5) (6,6,6,6,6)
n
10-1
100
l 2-e
rror
no faults
2 groups (~50% faults)
4 groups (~25% faults)
Again: excellent recovery properties!
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 36
![Page 61: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/61.jpg)
Overview
1 Motivation and Numerics
2 Scalability
3 Algorithm-Based Fault ToleranceHard FaultsSilent/Soft Faults
4 Summary
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 37
![Page 62: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/62.jpg)
Summary
GyrokineticsHigh-dimensional problem with urgent need for compute resources
Sparse grids: ”Too Big Data”⇒ Big Data
Hierarchical multilevel splitting provides novel handles on exa-challenges
Scalability2nd level of parallelismNumerical decoupling, extrapolationExploit hierarchical splitting for optimal communication
ABFT at low costExploit hierarchical schemeRecombination rather than recomputation
Silent faultsExploit underlying hierarchical basisDetection and treatment of silent faults possible
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 38
![Page 63: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/63.jpg)
Summary
GyrokineticsHigh-dimensional problem with urgent need for compute resources
Sparse grids: ”Too Big Data”⇒ Big Data
Hierarchical multilevel splitting provides novel handles on exa-challenges
Scalability2nd level of parallelismNumerical decoupling, extrapolationExploit hierarchical splitting for optimal communication
ABFT at low costExploit hierarchical schemeRecombination rather than recomputation
Silent faultsExploit underlying hierarchical basisDetection and treatment of silent faults possible
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 38
![Page 64: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/64.jpg)
Summary
GyrokineticsHigh-dimensional problem with urgent need for compute resources
Sparse grids: ”Too Big Data”⇒ Big Data
Hierarchical multilevel splitting provides novel handles on exa-challenges
Scalability reduce data in communication2nd level of parallelismNumerical decoupling, extrapolationExploit hierarchical splitting for optimal communication
ABFT at low cost avoid data storage and I/OExploit hierarchical schemeRecombination rather than recomputation
Silent faults limit communicated dataExploit underlying hierarchical basisDetection and treatment of silent faults possible
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 38
![Page 65: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/65.jpg)
Thanks to:
. . . and all others!
Thank you for your interest!
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 39
![Page 66: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/66.jpg)
Thanks to:
. . . and all others!
Thank you for your interest!
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 39
![Page 67: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation](https://reader036.fdocuments.us/reader036/viewer/2022070807/5f05f0717e708231d4157b07/html5/thumbnails/67.jpg)
Mario Heene, Alfredo Parra Hinojosa, Hans-Joachim Bungartz, and Dirk Pfluger.A massively-parallel, fault-tolerant solver for time-dependent pdes in high dimensions.In Euro-Par 2016, Grenoble, June 2016.Accepted.
Philipp Hupp, Mario Heene, Riko Jacob, and Dirk Pfluger.Global communication schemes for the numerical solution of high-dimensional PDEs.Parallel Computing, 52:78 – 105, 2016.
Mario Heene and Dirk Pfluger.Scalable algorithms for the solution of higher-dimensional PDEs.In Software for Exascale Computing-SPPEXA 2013-2015, pages 165–186. Springer InternationalPublishing, 2016.
Alfredo Parra Hinojosa, Christoph Kowitz, Mario Heene, Dirk Pfluger, and Hans-Joachim Bungartz.Towards a fault-tolerant, scalable implementation of GENE.In Recent Trends in Computational Engineering-CE2014, pages 47–65. Springer International Publishing,2015.
Alfredo Parra Hinojosa, Brendan Harding, Hegland Markus, and Hans-Joachim Bungartz.Handling silent data corruption with the sparse grid combination technique.In Proceedings of the SPPEXA Symposium, Lecture Notes in Computational Science and Engineering.Springer-Verlag, February 2016.
Dirk Pfluger, Hans-Joachim Bungartz, Michael Griebel, Frank Jenko, Tilman Dannert, Mario Heene,Alfredo Parra Hinojosa, Christoph Kowitz, and Peter Zaspel.EXAHD: An exa-scalable two-level sparse grid approach for higher-dimensional problems in plasmaphysics and beyond.In Euro-Par 2014 Workshop, Part II, volume 8806 of Lecture Notes in Computer Science, pages 566–577.Springer-Verlag, December 2014.
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 40