Post on 06-Feb-2016
description
Tanzima Zerin Islam
School of Electrical & Computer EngineeringPurdue UniversityWest Lafayette, INDate: April 8, 2013
Reliable and Scalable Checkpointing Systems for Distributed Computing Environments
Final exam of
Reliable & Scalable Checkpointing Systems 2
Distributed Computing Environments
Tanzima Islam (tislam@purdue.edu)
@Purdue
@Notre Dame
@Indiana U.
Internet
High Performance Computing (HPC):
Projected MTBF 3-26 minutes in exascaleFailure: hardware, software
Grid:Cycle sharing systemHighly volatile environmentFailure: eviction of guest jobs
Reliable & Scalable Checkpointing Systems 3
Fault-tolerance with Checkpoint-Restart
Checkpoints are execution statesSystem-level
Memory stateCompressible
Application-levelSelected variablesHard to compress
Tanzima Islam (tislam@purdue.edu)
Struct ToyGrp{1. float Temperature[1024];2. int Pressure[20][30];};
Reliable & Scalable Checkpointing Systems 4
Challenges in Checkpointing Systems
Tanzima Islam (tislam@purdue.edu)
@Purdue
@Notre Dame
@Indiana U.
Internet
HPC:Scalability of checkpointing systems
Grid:Use of dedicated checkpoint servers
Reliable & Scalable Checkpointing Systems 5
Contributions of This Thesis
Tanzima Islam (tislam@purdue.edu)
FALCON
Reliable Checkpointing System in Grid
[Best Student Paper Nomination, SC’09]
Scalable Checkpointing System in HPC
[Best Student Paper Nomination, SC’12]
Unpublished
Compression on Multi-core
2nd Place, ACM Student Research Competition’10
MCRENGINE
MCRCLUSTER
2007 - 2009 2009-2010 2010-2012 2012-2013
Prelim
Reliable & Scalable Checkpointing Systems 6
Agenda
[MCRENGINE] Scalable checkpointing system for HPC
[MCRCLUSTER] Benefit-aware clustering
Future directions
Tanzima Islam (tislam@purdue.edu)
A Scalable Checkpointing System using Data-Aware Aggregation and Compression
Collaborators: Kathryn Mohror, Adam Moody, Bronis de Supinski
Reliable & Scalable Checkpointing Systems 8Tanzima Islam (tislam@purdue.edu)
Compute Nodes
Gateway Nodes
Network Contention
Parallel File System
Hera Hera
Atl
as
Big Picture of HPC
Contention for Shared File System Resources
Contention for Other Clusters
Reliable & Scalable Checkpointing Systems 9
Checkpointing in HPC
MPI applicationsTake globally coordinated checkpoints asynchronously
Application-level checkpointHigh-level data format for portability
HDF5, Adios, netCDF etc.
Checkpoint writing
Application
I/O Library
Data-Format API
Struct ToyGrp{1. float Temperature[1024];2. short Pressure[20][30];};
NetCDFHDF5
1. HDF5 checkpoint{2. Group “/”{3. Group “ToyGrp”{
DATASET “Temperature”{DATATYPE H5T_IEEE_F32LEDATASPACE SIMPLE {(1024) / (1024)}}DATASET “Pressure” {DATATYPE H5T_STD_U8LEDATASPACE SIMPLE {(20,30) / (20,30)}}}}}
Parallel File System (PFS)
N1 (Funneled)
Parallel File System (PFS)
NN (Direct)
Parallel File System (PFS)
NM (Grouped)
Not scalable Best compromise but complex
Easiest butcontention on PFS
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 10
Impact of Load on PFS at Large Scale
IORDirect (NN): 78MB per process
Observations:(−) Large average write time less frequent checkpointing(−) Large average read time poor application performance
0200400600800
100012001400
# of Processes (N)
128
256
512
1024
2048
4096
8192
1540
80
20406080
100120140
Ave
rage
Rea
d T
ime
(s)
Ave
rage
Wri
te T
ime
(s)
# of Processes (N)
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 11
What is the Problem?
Today’s checkpoint-restart systems will not scaleIncreasing number of concurrent transfersIncreasing volume of checkpoint data
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 12
Our Contributions
Data-aware aggregationReduces the number of concurrent transfersImproves compressibility of checkpoints by using semantic information
Data-aware compressionImproves compression ratio by 115% compared to concatenation and general-purpose compression
Design and develop mcrEngineGrouped (NM) checkpointing systemImproves checkpointing frequencyImproves application performance
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 13
Agnostic scheme – concatenate checkpoints
Agnostic-block scheme – interleave fixed-size blocks
Observations:(+) Easy(−) Low compression ratio
Naïve Solution: Data-Agnostic Compression
C1
[1-B]C2
[1-B]
C1
[B+1-2B]
C2
[B+1-2B]
C1
[1-B]C1
[B+1-2B]
C2
[1-B]
C2
[B+1-2B]
C1
C2
C1 C2 pGzip PFS
pGzip PFS
First Phase
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 14
Our Solution: [Step 1] Identify Similar Variables Across Processes
C1.T C1.P C2.T C2.P
Group ToyGrp{ float Temperature[1024]; int Pressure[20][30];};
P0
Group ToyGrp{ float Temperature[100]; int Pressure[10][50];};
P1
Meta-data:1. Name2. Data-type3. Class:
-- Array, Atomic
Concatenating similar variables C2.TC1.T C2.PC1.P
[Step 2] Merging Scheme I: Aware Scheme
First ‘B’ bytes of TemperatureNext ‘B’ bytes of TemperatureInterleavePressure
C1.T C1.P C2.PC2.T
Interleaving similar variables
[Step 2] Merging Scheme II: Aware-Block Scheme
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 15
[Step 3] Data-Aware Aggregation & Compression
Aware scheme – concatenate similar variablesAware-block scheme – interleave similar variables
Output buffer
Data-type aware compression FPC Lempel-Ziv
T P H D
pGzip
PFS
C2.TC1.T C2.PC1.P
First Phase
Second Phase
C2.HC1.H C2.DC1.D
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 16
How MCRENGINE Works
CNCCNC
CNCAggregator
ComputeComponent
CNCCNC
CNCCompute Component
Aggregator
Aggregator
Meta-data
Meta-data
Meta-data
Identifies “similar” variables
Request T, P
Request T, P
Request T, P
Applies data-aware aggregation and compression
PFS
CNC : Compute node componentANC: Aggregator node componentRank-order groups, Grouped (NM) transfer
Group
CNCCNC
CNCComputeComponent
Group
Group
Request H, D
Request H, D
Request H, D
D
T P
T P
T P
H D
H D
H D
pGzip
pGzip
pGzip
HPT
HT P D
HT P D
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 17
Evaluation
ApplicationsALE3D – 4.8GB per checkpoint setCactus – 2.41GB per checkpoint setCosmology – 1.1GB per checkpoint setImplosion – 13MB per checkpoint set
Experimental test-bedLLNL’s Sierra: 261.3 TFLOP/s, Linux cluster23,328 cores, 1.3 Petabyte Lustre file system
Compression algorithmFPC [1] for double-precision floatFpzip [2] for single-precision floatLempel-Ziv for all other data-typespGzip for general-purpose compression
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 18
Evaluation Metrics
Effectiveness of data-aware compressionWhat is the benefit of multiple compression phases?How does group size affect compression ratio?
Performance of mcrEngineOverhead of the checkpointing phaseOverhead of the restart phase
Compression ratio = Uncompressed size
Compressed size
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 19
First Second First Second First Second First SecondALE3D Cactus Cosmology Implosion
0
0.5
1
1.5
2
2.5
3
3.5
4
No Benefit with Data-Agnostic Double Compression
Data-agnostic double compression is not beneficialBecause, data-format is non-uniform and uncompressible
Data-type aware compression improves compressibilityFirst phase changes underlying data format
Com
pres
sion
Rat
io
Data-Agnostic
Data-Aware
Multiple Phases of Data-Aware Compressionare Beneficial
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 20
1 2 4 8 16 322.5
3.5
4.5
1 2 4 8 16 320.5
1.5
2.5
Different merging schemes better for different applicationsLarger group size beneficial for certain applications
ALE3D: Improvement of 8% from group size 2 to 32
ALE3D Cactus
Com
pres
sion
Rat
io
Group size
Aware-Block
Aware
Impact of Group Size on Compression Ratio
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 21
Data-Aware Technique Always Wins over Data-Agnostic
Com
pres
sion
Rat
io
Group size
Aware-Block
Aware
Agnostic-Block
Agnostic
98-115%
Data-aware technique always yields better compression ratio than Data-Agnostic technique
Tanzima Islam (tislam@purdue.edu)
1 2 4 8 16 322.5
3.5
4.5
1 2 4 8 16 320.5
1.5
2.5
ALE3D Cactus
Reliable & Scalable Checkpointing Systems 22
Summary of Effectiveness Study
Data-aware compression always winsReduces gigabytes of data for Cactus
Larger group sizes may improve compression ratio
Different merging schemes for different applications
Compression ratio follows course of simulation
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 23
Impact of Data-Aware Compression on Latency
IOR with Grouped(NM) transfer, groups of 32 processesData-aware: 1.2GB, data-agnostic: 2.4GB
Data-aware compression improves I/O performance at large scaleImprovement during write 43% - 70%Improvement during read 48% - 70%
128256
5121024
20484096
8192
15424
16384
20480
24576
286720
50
100
150
200
250
300
# of Processes (N)
Ave
rage
Tra
nsfe
r T
ime
(se
c)
Agnostic
Aware
Agnostic-Read
Agnostic-Write
Aware-Read
Aware-Write
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 24
Impact of Aggregation & Compression on Latency
Used IORDirect (NN): 87MB per processGrouped (NM): Group size 32, 1.21GB per aggregator
128256
5121024
20484096
8192 10
20406080
100120140
128256
5121024
20484096
8192
154080
200400600800
100012001400
# of Processes (N)
Ave
rage
Wri
te T
ime
(sec
)A
vera
ge R
ead
Tim
e (s
ec)
N->N Write
N->M Write
N->N Read
N->M Read
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 25
No Comp.
Indiv. Comp
No Comp.
Agnostic Aware No Comp.
Indiv. Comp
No Comp.
Agnostic Aware
Direct Grouped Direct GroupedALE3D Cactus
0
50
100
150
200
250
300
350
End-to-End Checkpointing Overhead
15,408 processesGroup size of 32 for NM schemesEach process takes a checkpoint
Converts network bound operation into CPU bound one
Transfer Overhead
CPU Overhead
Tota
l Che
ckpo
inti
ng O
verh
ead
(sec
) Reduction in Checkpointing Overhead
87%51%
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 26
No Comp.
Indiv. Comp
No Comp.
Agnostic Aware No Comp.
Indiv. Comp
No Comp.
Agnostic Aware
Direct Grouped Direct GroupedALE3D Cactus
0
100
200
300
400
500
600
End-to-End Restart Overhead
Reduced overall restart overheadReduced network load and transfer time
Tota
l Rec
over
y O
verh
ead
(sec
)
43% 71%
Reduction in I/O Overhead
62% 64%
Reduction in Recovery Overhead
Tanzima Islam (tislam@purdue.edu)
Transfer Overhead
CPU Overhead
Reliable & Scalable Checkpointing Systems 27
Summary of Scalable Checkpointing System
Developed data-aware checkpoint compression technique Relative improvement in compression ratio up to 115%
Investigated different merging techniquesDemonstrated effectiveness using real-world applications
Designed and developed MCRENGINEReduces recovery overhead by more than 62%Reduces checkpointing overhead by up to 87%Improves scalability of checkpoint-restart systems
Tanzima Islam (tislam@purdue.edu)
Benefit-Aware Clustering of Checkpoints from Parallel Applications
Collaborators: Todd Gamblin, Kathryn Mohror, Adam Moody, Bronis de Supinski
Reliable & Scalable Checkpointing Systems 29
Our Goal & Contributions
Goal:Can suitably grouping checkpoints increase compressibility?
Contributions: Design new metric for “similarity” of checkpoints
Use this metric for clustering checkpoints
Evaluate the benefit of the clustering on checkpoint storage
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 30
Different Clustering Schemes
Tanzima Islam (tislam@purdue.edu)
13
16
15
10
11
3
14
9
8
1
12
6
4
27
5
2 3
5 6 7 8
9 10 11 12
14 15
1 4
1613
9
15
10
11
145
8
12
6
2
7
3
4
1
16
13
9
1510
11
14
5
8
12
6
2
7
3
4
1
16
13
Random Rank-wiseData-aware
Our Solution
Reliable & Scalable Checkpointing Systems 31
Research Questions
How to cluster checkpoints?
Does clustering improve compression ratio?
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 32
Benefit-Aware Clustering
Similarity metric: Improvement in reductionGoal: Minimize the total compressed size
Benefit matrix of Cactus
Tanzima Islam (tislam@purdue.edu)
β
Reliable & Scalable Checkpointing Systems 33
Novel Dissimilarity Metric
Tanzima Islam (tislam@purdue.edu)
Two factors for the dissimilarity between two checkpoints
Δ(i, j) = Σ [(i, k) – β(j, k)]2
k = 1
N
×β(i, j)
1
Reliable & Scalable Checkpointing Systems 34
Filter Order Similarity Cluster
How Benefit-Aware Clustering Works
Tanzima Islam (tislam@purdue.edu)
double T[3000];double V[10];double P[5000];double D[4000];double R[100];
double D[4000];double P[5000];double T[3000];
double T[3000];double P[5000];double D[4000];
Chunking
WaveletSample
D
PT
D
PT
β(14 )
Cluster 1
P1
P3
P4
Cluster 2
P2
P5
P1 P2 P3 P4 P5
Reliable & Scalable Checkpointing Systems 35
Structure of MCRCLUSTER
Tanzima Islam (tislam@purdue.edu)
F O S C
Compute Node
F O S CF O S C
F O S CF O S C
Aggregator
AggregatorP1
P2
P3
P4
P5
A1
A2
PFS
Reliable & Scalable Checkpointing Systems 36
Evaluation
Tanzima Islam (tislam@purdue.edu)
ApplicationIOR (synthetic checkpoints)Cactus
Experimental test-bedLLNL’s Sierra: 261.3 TFLOP/s, Linux cluster23,328 cores, 1.3 Petabyte Lustre file system
Evaluation metric:Macro benchmark: Effectiveness of clusteringMicro benchmark: Effectiveness of sampling
Reliable & Scalable Checkpointing Systems 37
Effectiveness of MCRCLUSTER
IOR: 32 checkpointsOdd processes write 0Even processes write: <rank> | 123456729% more compression compared to rank-wise, 22% more compared to random grouping
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 38
Effectiveness of Sampling
X axis: Each variableY axis: Range of benefit valuesTake away:
Chunking method preserves benefit relationships the closest
Tanzima Islam (tislam@purdue.edu)
Chunking Wavelet Transform
Reliable & Scalable Checkpointing Systems 39
Contributions of MCRCLUSTER
Design similarity and distance metric
Demonstrate significant result on synthetic data22% and 29% improvement compared to random and rank-wise clustering, respectively
Future directions for a first year Ph.D. studentStudy impact on real applicationsDesign scalable clustering technique
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 40
Applicability of My Research
Condor systems
Compression for scientific data
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 41
Conclusions
This thesis addresses:Reliability of checkpointing-based recovery in large-scale computing
Proposed three novel systems:FALCON: Distributed checkpointing system for GridsMCRENGINE: “Data-Aware Compression” and scalable checkpointing system for HPCMCRCLUSTER: “Benefit-Aware Clustering”
Provides a good foundation for further research in this field
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 42Tanzima Islam (tislam@purdue.edu)
Questions?
Reliable & Scalable Checkpointing Systems 43
Future Directions: Reliability
Reliability: Similarity-based process grouping for better compression
Group processes based on similarity instead of rank [On going]Analytical solution to group size selectionVariable streaming
Integrating mcrEngine with SCR
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 44
Future Directions: Performance
Cache usage analysis and optimizationDeveloped user-level tool for analyzing cache utilization [Summer’12]
Short term goals:Apply to real-applicationsAutomate analysis
Long-term goals:Suggest potential code optimizationsAutomate application tuning
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 45
Contact Information
Tanzima Islam (tislam@purdue.edu)Website: web.ics.purdue.edu/~tislam
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 46
Effectiveness of mcrCluster
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 47
Backup Slides
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 48
[Backup Slide] Failures in HPC
“A Large-scale Study of Failures in High-performance Computing Systems”, by Bianca Schroeder, Garth Gibson
Breakdown of root causes of failures Breakdown of downtime into root causes
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 49
[Backup Slide] Failures in HPC
“Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm”, by Laxmikant Kalé et. al.
Disparity between network bandwidth and memory size
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 50
[Backup Slides] Falcon
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 51
[Backup Slide] Breakdown of Overheads
Tanzima Islam (tislam@purdue.edu)
Performance scales with checkpoint sizesLower network transfer overhead
F D F D F D500MB 946MB 1677MB
0
20
40
60
80
100
120
140
160
180
Ch
eck
poi
nti
ng
Ove
rhea
d (
sec)
F D F D F D500MB 946MB 1677MB
0
20
40
60
80
100
120
140
160
180
Rec
over
y O
verh
ead
(se
c)
Reliable & Scalable Checkpointing Systems 52
[Backup Slide] Parallel Falcon
Tanzima Islam (tislam@purdue.edu)
67% improvement in CPU time
F PF D F PF D F PF D500MB 946MB 1677MB
0
20
40
60
80
100
120
140
160
180
Rec
over
y O
verh
ead
(se
c)
F PF D F PF D F PF D500MB 946MB 1677MB
0
20
40
60
80
100
120
140
160
180
Ch
eck
poi
nt
Sto
rin
g O
verh
ead
(se
c)
Reliable & Scalable Checkpointing Systems 53
[Backup Slides] mcrEngine
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 54
[Backup Slide] How to Find Similarity
Group ToyGrp{ float Temperature[1024]; short Pressure[20][30]; int Humidity;};
Group ToyGrp{ float Temperature[50]; short Pressure[2][6]; double Unit; int Humidity;};
P0
P1
Var: “ToyGrp/Temperature”Type: F32LE, Array1[1024]
Var: “ToyGrp/Pressure”Type: S8LE, Array2D [20][30]
Var: “ToyGrp/Temperature”Type: F32LE, Array1D [50]
Var: “ToyGrp/Pressure”Type: S8LE, Array2D [2][6]
Inside a checkpoint: Variables annotated with metadata
Inside source code: Variables represented as members of a group in actual source code. A group can be thought of the construct “Struct” in C
Generated hash key for matching
Var: “ToyGrp/Unit”Type: F64LE, Atomic
Var: “ToyGrp/Humidity”Type: I32LE, Atomic
ToyGrp/Temperature_F32LE_Array1D
ToyGrp/Pressure_S8LE_Array2D
ToyGrp/Humidity_I32LE_Atomic
ToyGrp/Temperature_F32LE_Array1D
ToyGrp/Pressure_S8LE_Array2D
ToyGrp/Unit_F64LE_Atomic
Var: “ToyGrp/Humidity”Type I32LE, Atomic ToyGrp/Humidity_I32LE_Atomic
No match
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 55
[Backup Slide] Compression Ratio Follows Course of Simulation
Data-aware technique always yields better compression
Cactus
Com
pres
sion
Rat
io
Aware-Block
Aware
Agnostic-Block
Agnostic
0.81
1.21.41.61.8
22.2
1.0
2.0
3.0
4.0
5.0
6.0
1.3
1.5
1.7
1.9
2.1
2.3
Simulation Time-steps
Cosmology Implosion
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 56
[Backup Slide] Relative Improvement in Compression Ratio Compared to Data-Agnostic Scheme
Application Total Size (GB)
Data Types(%)D F S F Int
Aware-Block (%)
Aware (%)
ALE3D 4.8 88.8 ~0 11.2 6.6 - 27.7 6.6 - 12.7
Cactus 2.41 33.94
0 66.06 10.7 – 11.9 98 - 115
Cosmology 1.1 24.3 67.2 8.5 20.1 – 25.6 20.6 – 21.1
Implosion 0.013 0 74.1 25.9 36.3 – 38.4 36.3 – 38.8
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 57
References
1. M. Burtscher and P. Ratanaworabhan, “FPC: A High-speed Compressor for Double-Precision Floating-Point Data”.
2. P. Lindstrom and M. Isenburg, “Fast and Efficient Compression of Floating-Point Data”.
3. L. Reinhold, “QuickLZ”.
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 58
Reliable and Efficient System for Storing Checkpoints in Grid
Tanzima Islam (tislam@purdue.edu)
Execution Environment: Grid
Reliable & Scalable Checkpointing Systems 59
State-of-the-Art: Checkpointing in Grid with Dedicated Storage
Submitter
Tanzima Islam (tislam@purdue.edu)
@Purdue
@Notre Dame
@Indiana U.
Internet
Problems:(−) High transfer latency(−) Contention on servers(−) Stress on shared network resources
Dedicated Storage Server
Reliable & Scalable Checkpointing Systems 60
Research Question
Can we improve the performance of applications by storing checkpoints on the
grid resources?
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 61
Overview of Our Solution: Checkpointing in Grid with Distributed Storage
SubmitterTanzima Islam (tislam@purdue.edu)
@Purdue
@Notre Dame
@Indiana U.
Internet
Q1. Which storage nodes?Q2. How to balance load?Q3. How to efficiently storage & retrieve?Constraint: -- All components must be user-level
Reliable & Scalable Checkpointing Systems 62
Answer to Q1: Storage Host Selection
Build failure model for storage resourcesCompute correlated temporal reliability
Based on historical data
Rank machines Based on: reliability, load, and network overhead
Output:(m+k) storage hosts
Tanzima Islam (tislam@purdue.edu)
Compute Host
Storage Host 1Storage Host 2
down
down
downdown
Objective function: checkpoint storing overhead – benefit from restart
Addresses Q2
Reliable & Scalable Checkpointing Systems 63
Checkpoint-Recovery Scheme
Compression
Erasure Encoding
(m+k)
21 m+k
Original Checkpoint
Disk
Compressed
Fragments
Storage Host
Checkpoint Storing Phase
21 m+k
Decompression
Erasure Decoding (m)
Disk
Compressed
Original Checkpoint
Recovery Phase
Fragments
Storage Host
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 64
Evaluation Setup
2 different applications with 4 input setsMCF (SPEC CPU 2006)TIGR (BioBench)
System-level checkpointsMacro benchmark experiment
Average job makespan
Micro benchmark experimentsEfficiency of checkpoint and restartEfficiency in handling simultaneous clientsEfficiency in handling multiple failures
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 65
D F D F D F500MB 946MB 1677MB
0
20
40
60
80
100
120
140
160
180
Rec
over
y O
verh
ead
(se
c)
Checkpoint Storing & Recovery Overhead
Tanzima Islam (tislam@purdue.edu)
Performance scales with checkpoint sizesLower network transfer overhead
D F D F D F500MB 946MB 1677MB
0
20
40
60
80
100
120
140
160
180
Ch
eck
poi
nti
ng
Ove
rhea
d (
sec)
Transfer Overhead
CPU Overhead
Reliable & Scalable Checkpointing Systems 66
Overall Performance Comparison
Performance improvement between 11% and 44%
Tanzima Islam (tislam@purdue.edu)
mcf tigr0
20
40
60
80
100
120
Benchmark Applications
Ave
rage
Mak
esp
an T
ime
(min
)
Remote Dedicated Server
Local Dedicated Server
Falcon with Distributed Server
Reliable & Scalable Checkpointing Systems 67
Summary of Reliable Checkpointing System
Developed a reliable checkpoint-recovery system FALCON
Select reliable storage hostsPrefer lightly loaded onesCompress and encodeStore and retrieve efficiently
Ran experiments with FALCON in DiaGridPerformance improvement between 11% and 44%
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 68Tanzima Islam (tislam@purdue.edu)
Compute Nodes
Gateway Nodes
Network Contention
Parallel File System
Hera Hera
Atl
as
Checkpointing in HPC
Contention for Shared File System Resources
Contention for Other Clusters for File System
Reliable & Scalable Checkpointing Systems 69
2-D vs N-D Compression
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 70Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems 71
Challenge in Extreme-Scale: Increase in Failure-Rate
Tanzima Islam (tislam@purdue.edu)
1 Eflop/s
1 Gflop/s
1 Tflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
N=1
N=500
Reliable & Scalable Checkpointing Systems 72
Towards Online Clustering
Reduce dimension of βReduce the number of variables
Representative data-typeNumber of elements greater than a threshold [Example: 100 variables double-type covers 80% of data]
Reduce the amount of dataSampling: Random, chunking and wavelet
Tanzima Islam (tislam@purdue.edu)