Reliable and Scalable Checkpointing Systems for Distributed Computing Environments

Tanzima Zerin Islam

School of Electrical & Computer EngineeringPurdue UniversityWest Lafayette, INDate: April 8, 2013

Final exam of

Reliable & Scalable Checkpointing Systems 2

Distributed Computing Environments

Tanzima Islam (tislam@purdue.edu)

@Purdue

@Notre Dame

@Indiana U.

Internet

High Performance Computing (HPC):

Projected MTBF 3-26 minutes in exascaleFailure: hardware, software

Grid:Cycle sharing systemHighly volatile environmentFailure: eviction of guest jobs

Fault-tolerance with Checkpoint-Restart

Checkpoints are execution statesSystem-level

Memory stateCompressible

Application-levelSelected variablesHard to compress

Struct ToyGrp{1. float Temperature[1024];2. int Pressure[20][30];};

Challenges in Checkpointing Systems

@Purdue

@Notre Dame

@Indiana U.

Internet

HPC:Scalability of checkpointing systems

Grid:Use of dedicated checkpoint servers

Contributions of This Thesis

FALCON

Reliable Checkpointing System in Grid

[Best Student Paper Nomination, SC’09]

Scalable Checkpointing System in HPC

[Best Student Paper Nomination, SC’12]

Unpublished

Compression on Multi-core

2nd Place, ACM Student Research Competition’10

MCRENGINE

MCRCLUSTER

2007 - 2009 2009-2010 2010-2012 2012-2013

Prelim

Agenda

[MCRENGINE] Scalable checkpointing system for HPC

[MCRCLUSTER] Benefit-aware clustering

Future directions

A Scalable Checkpointing System using Data-Aware Aggregation and Compression

Collaborators: Kathryn Mohror, Adam Moody, Bronis de Supinski

Reliable & Scalable Checkpointing Systems 8Tanzima Islam (tislam@purdue.edu)

Compute Nodes

Gateway Nodes

Network Contention

Parallel File System

Hera Hera

Big Picture of HPC

Contention for Shared File System Resources

Contention for Other Clusters

Checkpointing in HPC

MPI applicationsTake globally coordinated checkpoints asynchronously

Application-level checkpointHigh-level data format for portability

HDF5, Adios, netCDF etc.

Checkpoint writing

Application

I/O Library

Data-Format API

Struct ToyGrp{1. float Temperature[1024];2. short Pressure[20][30];};

NetCDFHDF5

1. HDF5 checkpoint{2. Group “/”{3. Group “ToyGrp”{

DATASET “Temperature”{DATATYPE H5T_IEEE_F32LEDATASPACE SIMPLE {(1024) / (1024)}}DATASET “Pressure” {DATATYPE H5T_STD_U8LEDATASPACE SIMPLE {(20,30) / (20,30)}}}}}

Parallel File System (PFS)

N1 (Funneled)

NN (Direct)

NM (Grouped)

Not scalable Best compromise but complex

Easiest butcontention on PFS

Impact of Load on PFS at Large Scale

IORDirect (NN): 78MB per process

Observations:(−) Large average write time less frequent checkpointing(−) Large average read time poor application performance

0200400600800

100012001400

# of Processes (N)

20406080

100120140

# of Processes (N)

What is the Problem?

Today’s checkpoint-restart systems will not scaleIncreasing number of concurrent transfersIncreasing volume of checkpoint data

Our Contributions

Data-aware aggregationReduces the number of concurrent transfersImproves compressibility of checkpoints by using semantic information

Data-aware compressionImproves compression ratio by 115% compared to concatenation and general-purpose compression

Design and develop mcrEngineGrouped (NM) checkpointing systemImproves checkpointing frequencyImproves application performance

Agnostic scheme – concatenate checkpoints

Agnostic-block scheme – interleave fixed-size blocks

Observations:(+) Easy(−) Low compression ratio

Naïve Solution: Data-Agnostic Compression

[1-B]C2

[B+1-2B]

[1-B]C1

[B+1-2B]

C1 C2 pGzip PFS

pGzip PFS

First Phase

Our Solution: [Step 1] Identify Similar Variables Across Processes

C1.T C1.P C2.T C2.P

Group ToyGrp{ float Temperature[1024]; int Pressure[20][30];};

Group ToyGrp{ float Temperature[100]; int Pressure[10][50];};

Meta-data:1. Name2. Data-type3. Class:

-- Array, Atomic

Concatenating similar variables C2.TC1.T C2.PC1.P

[Step 2] Merging Scheme I: Aware Scheme

First ‘B’ bytes of TemperatureNext ‘B’ bytes of TemperatureInterleavePressure

C1.T C1.P C2.PC2.T

Interleaving similar variables

[Step 2] Merging Scheme II: Aware-Block Scheme

[Step 3] Data-Aware Aggregation & Compression

Aware scheme – concatenate similar variablesAware-block scheme – interleave similar variables

Output buffer

Data-type aware compression FPC Lempel-Ziv

T P H D

C2.TC1.T C2.PC1.P

First Phase

Second Phase

C2.HC1.H C2.DC1.D

How MCRENGINE Works

CNCCNC

CNCAggregator

ComputeComponent

CNCCNC

CNCCompute Component

Aggregator

Meta-data

Identifies “similar” variables

Request T, P

Applies data-aware aggregation and compression

CNC : Compute node componentANC: Aggregator node componentRank-order groups, Grouped (NM) transfer

CNCCNC

CNCComputeComponent

Request H, D

HT P D

Evaluation

ApplicationsALE3D – 4.8GB per checkpoint setCactus – 2.41GB per checkpoint setCosmology – 1.1GB per checkpoint setImplosion – 13MB per checkpoint set

Experimental test-bedLLNL’s Sierra: 261.3 TFLOP/s, Linux cluster23,328 cores, 1.3 Petabyte Lustre file system

Compression algorithmFPC [1] for double-precision floatFpzip [2] for single-precision floatLempel-Ziv for all other data-typespGzip for general-purpose compression

Evaluation Metrics

Effectiveness of data-aware compressionWhat is the benefit of multiple compression phases?How does group size affect compression ratio?

Performance of mcrEngineOverhead of the checkpointing phaseOverhead of the restart phase

Compression ratio = Uncompressed size

Compressed size

First Second First Second First Second First SecondALE3D Cactus Cosmology Implosion

No Benefit with Data-Agnostic Double Compression

Data-agnostic double compression is not beneficialBecause, data-format is non-uniform and uncompressible

Data-type aware compression improves compressibilityFirst phase changes underlying data format

Data-Agnostic

Data-Aware

Multiple Phases of Data-Aware Compressionare Beneficial

1 2 4 8 16 322.5

1 2 4 8 16 320.5

Different merging schemes better for different applicationsLarger group size beneficial for certain applications

ALE3D: Improvement of 8% from group size 2 to 32

ALE3D Cactus

Group size

Aware-Block

Impact of Group Size on Compression Ratio

Data-Aware Technique Always Wins over Data-Agnostic

Group size

Aware-Block

Agnostic-Block

Agnostic

98-115%

Data-aware technique always yields better compression ratio than Data-Agnostic technique

1 2 4 8 16 322.5

1 2 4 8 16 320.5

ALE3D Cactus

Summary of Effectiveness Study

Data-aware compression always winsReduces gigabytes of data for Cactus

Larger group sizes may improve compression ratio

Different merging schemes for different applications

Compression ratio follows course of simulation

Impact of Data-Aware Compression on Latency

IOR with Grouped(NM) transfer, groups of 32 processesData-aware: 1.2GB, data-agnostic: 2.4GB

Data-aware compression improves I/O performance at large scaleImprovement during write 43% - 70%Improvement during read 48% - 70%

128256

5121024

20484096

286720

# of Processes (N)

Agnostic

Agnostic-Read

Agnostic-Write

Aware-Read

Aware-Write

Impact of Aggregation & Compression on Latency

Used IORDirect (NN): 87MB per processGrouped (NM): Group size 32, 1.21GB per aggregator

128256

5121024

20484096

8192 10

20406080

100120140

128256

5121024

20484096

154080

200400600800

100012001400

# of Processes (N)

N->N Write

N->M Write

N->N Read

N->M Read

No Comp.

Indiv. Comp

No Comp.

Agnostic Aware No Comp.

Indiv. Comp

No Comp.

Agnostic Aware

Direct Grouped Direct GroupedALE3D Cactus

End-to-End Checkpointing Overhead

15,408 processesGroup size of 32 for NM schemesEach process takes a checkpoint

Converts network bound operation into CPU bound one

Transfer Overhead

CPU Overhead

) Reduction in Checkpointing Overhead

87%51%

No Comp.

Indiv. Comp

No Comp.

Agnostic Aware No Comp.

Indiv. Comp

No Comp.

Agnostic Aware

Direct Grouped Direct GroupedALE3D Cactus

End-to-End Restart Overhead

Reduced overall restart overheadReduced network load and transfer time

43% 71%

Reduction in I/O Overhead

62% 64%

Reduction in Recovery Overhead

Transfer Overhead

CPU Overhead

Summary of Scalable Checkpointing System

Developed data-aware checkpoint compression technique Relative improvement in compression ratio up to 115%

Investigated different merging techniquesDemonstrated effectiveness using real-world applications

Designed and developed MCRENGINEReduces recovery overhead by more than 62%Reduces checkpointing overhead by up to 87%Improves scalability of checkpoint-restart systems

Benefit-Aware Clustering of Checkpoints from Parallel Applications

Collaborators: Todd Gamblin, Kathryn Mohror, Adam Moody, Bronis de Supinski

Our Goal & Contributions

Goal:Can suitably grouping checkpoints increase compressibility?

Contributions: Design new metric for “similarity” of checkpoints

Use this metric for clustering checkpoints

Evaluate the benefit of the clustering on checkpoint storage

Different Clustering Schemes

5 6 7 8

9 10 11 12

Random Rank-wiseData-aware

Our Solution

Research Questions

How to cluster checkpoints?

Does clustering improve compression ratio?

Benefit-Aware Clustering

Similarity metric: Improvement in reductionGoal: Minimize the total compressed size

Benefit matrix of Cactus

Novel Dissimilarity Metric

Two factors for the dissimilarity between two checkpoints

Δ(i, j) = Σ [(i, k) – β(j, k)]2

×β(i, j)

Filter Order Similarity Cluster

How Benefit-Aware Clustering Works

double T[3000];double V[10];double P[5000];double D[4000];double R[100];

double D[4000];double P[5000];double T[3000];

double T[3000];double P[5000];double D[4000];

Chunking

WaveletSample

β(14 )

Cluster 1

Cluster 2

P1 P2 P3 P4 P5

Structure of MCRCLUSTER

F O S C

Compute Node

F O S CF O S C

Aggregator

AggregatorP1

Evaluation

ApplicationIOR (synthetic checkpoints)Cactus

Experimental test-bedLLNL’s Sierra: 261.3 TFLOP/s, Linux cluster23,328 cores, 1.3 Petabyte Lustre file system

Evaluation metric:Macro benchmark: Effectiveness of clusteringMicro benchmark: Effectiveness of sampling

Effectiveness of MCRCLUSTER

IOR: 32 checkpointsOdd processes write 0Even processes write: <rank> | 123456729% more compression compared to rank-wise, 22% more compared to random grouping

Effectiveness of Sampling

X axis: Each variableY axis: Range of benefit valuesTake away:

Chunking method preserves benefit relationships the closest

Chunking Wavelet Transform

Contributions of MCRCLUSTER

Design similarity and distance metric

Demonstrate significant result on synthetic data22% and 29% improvement compared to random and rank-wise clustering, respectively

Future directions for a first year Ph.D. studentStudy impact on real applicationsDesign scalable clustering technique

Applicability of My Research

Condor systems

Compression for scientific data

Conclusions

This thesis addresses:Reliability of checkpointing-based recovery in large-scale computing

Proposed three novel systems:FALCON: Distributed checkpointing system for GridsMCRENGINE: “Data-Aware Compression” and scalable checkpointing system for HPCMCRCLUSTER: “Benefit-Aware Clustering”

Provides a good foundation for further research in this field

Questions?

Future Directions: Reliability

Reliability: Similarity-based process grouping for better compression

Group processes based on similarity instead of rank [On going]Analytical solution to group size selectionVariable streaming

Integrating mcrEngine with SCR

Future Directions: Performance

Cache usage analysis and optimizationDeveloped user-level tool for analyzing cache utilization [Summer’12]

Short term goals:Apply to real-applicationsAutomate analysis

Long-term goals:Suggest potential code optimizationsAutomate application tuning

Contact Information

Tanzima Islam (tislam@purdue.edu)Website: web.ics.purdue.edu/~tislam

Effectiveness of mcrCluster

Backup Slides

[Backup Slide] Failures in HPC

“A Large-scale Study of Failures in High-performance Computing Systems”, by Bianca Schroeder, Garth Gibson

Breakdown of root causes of failures Breakdown of downtime into root causes

[Backup Slide] Failures in HPC

“Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm”, by Laxmikant Kalé et. al.

Disparity between network bandwidth and memory size

[Backup Slides] Falcon

[Backup Slide] Breakdown of Overheads

Performance scales with checkpoint sizesLower network transfer overhead

F D F D F D500MB 946MB 1677MB

[Backup Slide] Parallel Falcon

67% improvement in CPU time

F PF D F PF D F PF D500MB 946MB 1677MB

[Backup Slides] mcrEngine

[Backup Slide] How to Find Similarity

Group ToyGrp{ float Temperature[1024]; short Pressure[20][30]; int Humidity;};

Group ToyGrp{ float Temperature[50]; short Pressure[2][6]; double Unit; int Humidity;};

Var: “ToyGrp/Temperature”Type: F32LE, Array1[1024]

Var: “ToyGrp/Pressure”Type: S8LE, Array2D [20][30]

Var: “ToyGrp/Temperature”Type: F32LE, Array1D [50]

Var: “ToyGrp/Pressure”Type: S8LE, Array2D [2][6]

Inside a checkpoint: Variables annotated with metadata

Inside source code: Variables represented as members of a group in actual source code. A group can be thought of the construct “Struct” in C

Generated hash key for matching

Var: “ToyGrp/Unit”Type: F64LE, Atomic

Var: “ToyGrp/Humidity”Type: I32LE, Atomic

ToyGrp/Temperature_F32LE_Array1D

ToyGrp/Pressure_S8LE_Array2D

ToyGrp/Humidity_I32LE_Atomic

ToyGrp/Temperature_F32LE_Array1D

ToyGrp/Pressure_S8LE_Array2D

ToyGrp/Unit_F64LE_Atomic

Var: “ToyGrp/Humidity”Type I32LE, Atomic ToyGrp/Humidity_I32LE_Atomic

No match

[Backup Slide] Compression Ratio Follows Course of Simulation

Data-aware technique always yields better compression

Cactus

Aware-Block

Agnostic-Block

Agnostic

1.21.41.61.8

Simulation Time-steps

Cosmology Implosion

[Backup Slide] Relative Improvement in Compression Ratio Compared to Data-Agnostic Scheme

Application Total Size (GB)

Data Types(%)D F S F Int

Aware-Block (%)

Aware (%)

ALE3D 4.8 88.8 ~0 11.2 6.6 - 27.7 6.6 - 12.7

Cactus 2.41 33.94

0 66.06 10.7 – 11.9 98 - 115

Cosmology 1.1 24.3 67.2 8.5 20.1 – 25.6 20.6 – 21.1

Implosion 0.013 0 74.1 25.9 36.3 – 38.4 36.3 – 38.8

References

1. M. Burtscher and P. Ratanaworabhan, “FPC: A High-speed Compressor for Double-Precision Floating-Point Data”.

2. P. Lindstrom and M. Isenburg, “Fast and Efficient Compression of Floating-Point Data”.

3. L. Reinhold, “QuickLZ”.

Reliable and Efficient System for Storing Checkpoints in Grid

Execution Environment: Grid

State-of-the-Art: Checkpointing in Grid with Dedicated Storage

Submitter

@Purdue

@Notre Dame

@Indiana U.

Internet

Problems:(−) High transfer latency(−) Contention on servers(−) Stress on shared network resources

Dedicated Storage Server

Research Question

Can we improve the performance of applications by storing checkpoints on the

grid resources?

Overview of Our Solution: Checkpointing in Grid with Distributed Storage

SubmitterTanzima Islam (tislam@purdue.edu)

@Purdue

@Notre Dame

@Indiana U.

Internet

Q1. Which storage nodes?Q2. How to balance load?Q3. How to efficiently storage & retrieve?Constraint: -- All components must be user-level

Answer to Q1: Storage Host Selection

Build failure model for storage resourcesCompute correlated temporal reliability

Based on historical data

Rank machines Based on: reliability, load, and network overhead

Output:(m+k) storage hosts

Compute Host

Storage Host 1Storage Host 2

downdown

Objective function: checkpoint storing overhead – benefit from restart

Addresses Q2

Checkpoint-Recovery Scheme

Compression

Erasure Encoding

21 m+k

Original Checkpoint

Compressed

Fragments

Storage Host

Checkpoint Storing Phase

21 m+k

Decompression

Erasure Decoding (m)

Compressed

Original Checkpoint

Recovery Phase

Fragments

Storage Host

Evaluation Setup

2 different applications with 4 input setsMCF (SPEC CPU 2006)TIGR (BioBench)

System-level checkpointsMacro benchmark experiment

Average job makespan

Micro benchmark experimentsEfficiency of checkpoint and restartEfficiency in handling simultaneous clientsEfficiency in handling multiple failures

D F D F D F500MB 946MB 1677MB

Checkpoint Storing & Recovery Overhead

Performance scales with checkpoint sizesLower network transfer overhead

D F D F D F500MB 946MB 1677MB

Transfer Overhead

CPU Overhead

Overall Performance Comparison

Performance improvement between 11% and 44%

mcf tigr0

Benchmark Applications

Remote Dedicated Server

Local Dedicated Server

Falcon with Distributed Server

Summary of Reliable Checkpointing System

Developed a reliable checkpoint-recovery system FALCON

Select reliable storage hostsPrefer lightly loaded onesCompress and encodeStore and retrieve efficiently

Ran experiments with FALCON in DiaGridPerformance improvement between 11% and 44%

Compute Nodes

Gateway Nodes

Network Contention

Parallel File System

Hera Hera

Checkpointing in HPC

Contention for Shared File System Resources

Contention for Other Clusters for File System

2-D vs N-D Compression

Challenge in Extreme-Scale: Increase in Failure-Rate

1 Eflop/s

1 Gflop/s

1 Tflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

Towards Online Clustering

Reduce dimension of βReduce the number of variables

Representative data-typeNumber of elements greater than a threshold [Example: 100 variables double-type covers 80% of data]

Reduce the amount of dataSampling: Random, chunking and wavelet

Reliable and Scalable Checkpointing Systems for Distributed Computing Environments

Documents

Transcript of Reliable and Scalable Checkpointing Systems for Distributed Computing Environments

An Efficient Distributed Burst Buffer for Linuxopensfs.org/wp-content/uploads/2014/04/D2_S24_AnEfficientDistribu… · 6 SOS: The Scalable Object Store. Checkpointing Parameters •

Evaluating Overheads of Integrated Multilevel Checkpointing

Scalable Engaged Learning Environments

Applications of DMTCP for Checkpointing · What is Checkpointing? Checkpointing is the action of saving the state of a running process to a checkpoint image ﬁle. Checkpointing supports

Parallel Checkpointing - Sathish Vadhiyar. Introduction Checkpointing? storing application’s state in order to resume later.

Checkpointing & Rollback Recovery

Scape project presentation - Scalable Preservation Environments

Fault Tolerance and Checkpointing - Sathish Vadhiyar.

Checkpointing 2.0

SRDS’03 Performance and Effectiveness Analysis of Checkpointing in Mobile Environments Xinyu Chen and Michael R. Lyu The Chinese Univ. of Hong Kong Hong.

Ch13 Checkpointing and Recovery

Checkpointing 2.0 Compiler-Assisted Checkpointing Uncoordinated Checkpointing.

Scalable Compression and Replay of Communication Traces in Massively Parallel Environments

Watch This: Scalable Cost-Function Learning for Path Planning in Urban Environments · 2016-07-11 · Watch This: Scalable Cost-Function Learning for Path Planning in Urban Environments

SCAPE Scalable Preservation Environments. 2 Its all about scalability! Scalable services for planning and execution of institutional preservation strategies.

Speculative Memory Checkpointing

Diskless Checkpointing

Rebound: Scalable Checkpointing for Coherent Shared ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca11_2.pdf · • Processors synchronize and resume execution • Hardware automatically

D6.1 A scalable data management architecture for Smart ... A Scalable Data... · Almanac D6.1 A scalable data management architecture for Smart City environments ... Example of KairosDB

Rebound: Scalable Checkpointing for Coherent Shared Memory