Post on 29-Jan-2016
http://ftg.lbl.gov/checkpointcheckpoint@lbl.gov
An Overview of
Berkeley Lab’s
Linux Checkpoint/Restart
(BLCR)
Paul Hargrove with Jason Duell and Eric RomanJanuary 13th, 2004
(Based on slides by Jason Duell)
http://ftg.lbl.gov/checkpointcheckpoint@lbl.gov
Linux checkpoint/restart
Outline
Project goals
System design
Entension interface
Current status
Future work
http://ftg.lbl.gov/checkpointcheckpoint@lbl.gov
Uses of Checkpoint/Restart
Gang scheduling● No queue drain for maintenance, policy change● Higher utilization and/or more flexible scheduling
Process migration● Save job if node failure imminent● Pack jobs for optimal network performance
Periodic backup● Not our main focus● Application can always do more efficiently● But may be useful for systems with long jobs, fast I/O,
and/or high node failure rates
http://ftg.lbl.gov/checkpointcheckpoint@lbl.gov
Implementation Strategies
Application-based checkpointing● Efficient: save only needed data as step completes● Good for fault tolerance: bad for preemption● Requires per-application effort by programmer
Library-based checkpointing● Portable across operating systems● Transparent to application (but may require relink, etc.)● Can't (generally) restore all resources (ex: process IDs)
● Can’t checkpoint shell scripts
Kernel-based checkpointing● Not portable, and harder to implement● Can save/restore all resources
http://ftg.lbl.gov/checkpointcheckpoint@lbl.gov
Design Goals
Target: parallel scientific applications● MPI is a must ● But allow support for other programs/models, too● Esoteric features (ptrace, Unix domain sockets) have
lower implementation priority
Implemention: Linux kernel module● lower barrier to adoption than kernel patch● Allows upgrades, bug fixes, without reboot
http://ftg.lbl.gov/checkpointcheckpoint@lbl.gov
Design Goals II
Provide ‘toolkit’ for distributed C/R● We provide single node checkpoint/restart● We don’t support distributed operating system features
• No built-in support for TCP sockets, bproc namespaces, etc.● We provide hooks to allow parallel runtimes/libraries to
implement distributed checkpoint/restart• So the MPI library needs to know about checkpointing, but user
applications don’t
http://ftg.lbl.gov/checkpointcheckpoint@lbl.gov
Extension Interface
Callback functions● Registered at startup (or as needed)● Run at checkpoint time, then resume at restart/continue● Handle parallel coordination and/or unsupported objects
Two types of callbacks● Signal handler context
● Run with same PID (LinuxThreads); no thread-safety needed● But callback limited to calling signal-safe functions (small subset of POSIX)
● Separate thread context● Can call any function● But code needs to be thread-safe, and separate PID (LinuxThreads)
Critical sections● Use to protect uncheckpointable sections of code
http://ftg.lbl.gov/checkpointcheckpoint@lbl.gov
Current Status
Support LAM-MPI jobs● Both TCP and Myrinet supported● Infrastructure in place for Infiniband, Quadrics● Process migration: currently must restart whole job
Simple semantics for open files● Reopen and seek to original position● Must be regular files (pipe support coming soon)● Files must exist in same location on filesystem
Single- and multi-threaded processes● checkpoint of ‘mpirun’ checkpoints whole MPI job● Will support process groups, sessions in future● Restore original PID
http://ftg.lbl.gov/checkpointcheckpoint@lbl.gov
Current status II
Work with wide variety of 2.4 kernels● kernel.org versions 2.4.3 onwards● RedHat: 7.2 through 9 ● SuSE: 7.2 through 9.0 ● autoconf feature probing, so support of custom patched
kernels likely to be automatic● we’ll maintain 2.4 support once 2.6 comes out
Support both new and old pthreads● I.e., old “LinuxThreads”, plus new 2.6 pthreads
(backported to 2.4 by Red Hat)
http://ftg.lbl.gov/checkpointcheckpoint@lbl.gov
Future Work
Support for sessions & process groups● Including pipes, mmaps, etc., shared within group● Full restoration of parent/child tree, with original PIDs
More semantics for files● Allow checksum of file, with restart error if it has changed● Allow saving contents of file (restore either clobbers, or opens
anonymously)● Support files that are not open at checkpoint time, but are
specified as being part of the checkpoint
Laundry list of other resources to support● Page 4 of “Design and Implementation” paper
http://ftg.lbl.gov/checkpointcheckpoint@lbl.gov
Future Work II
Integration with parallel job systems● Funded to work within suite from DOE Scalable systems
software SciDAC. Work is in progress.● Possibility of OpenPBS, PBSPro support● Interested in others (LSF, SGE, SLURM, etc.)
More MPI implementations● MPICH 2 support anticipated● Vendor support (Quadrics)?● LAM/MPI support for partial/live migration
http://ftg.lbl.gov/checkpointcheckpoint@lbl.gov
Conclusion
http://ftg.lbl.gov/checkpoint
Papers (available from website):● “Design and Implementation of BLCR”: high-level system
design, including description of user API● “Requirements for Linux Checkpoint/Restart”: exhaustive
list of Unix features we will support (or not).● “A Survey of Checkpoint/Restart Implementations”:
focusing on open source versions that run on Linux ● “The LAM/MPI Checkpoint/Restart Framework: System-
Initiated Checkpointing”: implementation with LAM/MPI