Post on 19-Jun-2020
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Hobbes: OS and Run/me Support for Applica/on Composi/on
Ron Brightwell Coordina/ng PI X-‐Stack/OSR PI Mee/ng December 7-‐8, 2015
Hobbes Team Institution Person Role
Georgia Institute of Technology Ada Gavrilovska PI
Indiana University Thomas Sterling PI
Los Alamos National Lab Mike Lang PI
Lawrence Berkeley National Lab Costin Iancu PI
North Carolina State University Frank Mueller PI
Northwestern University Peter Dinda PI
Oak Ridge National Laboratory David Bernholdt PI
Oak Ridge National Laboratory Arthur B. Maccabe Chief Scientist
Sandia National Laboratories Ron Brightwell Coordinating PI
University of Arizona David Lowenthal PI
University of California – Berkeley Eric Brewer PI
University of New Mexico Patrick Bridges PI
University of Pittsburgh Jack Lange PI
Project Goals § Deliver prototype OS/R environment for R&D in extreme-‐
scale scien/fic compu/ng § Focus on applica/on composi/on as a fundamental driver
§ Develop necessary OS/R interfaces and system services required to support resource isola/on and sharing
§ Support complex simula/on and analysis workflows
§ Provide a lightweight OS/R environment with flexibility to build custom run/mes § Compose applica/ons from a collec/on of enclaves
§ Leverage KiZen lightweight kernel and Palacios lightweight virtual machine monitor § Enable high-‐risk high-‐impact research in virtualiza/on, energy/power,
scheduling, and resilience
KiZen Lightweight Kernel § Monolithic, C code, GNU toolchain, Kbuild configura/on § Supports x86-‐64 and ARM-‐64
§ Boots on standard PC architecture, Cray XT, and in virtual machines § Boots iden/cally to Linux (KiZen bzImage and init_task)
§ Repurposes basic func/onality from Linux § Hardware bootstrap § Basic OS kernel primi/ves (lists, locks, wait queues, etc.) § Directory structure similar to Linux, arch dependent/independent dirs
§ Custom address space management and task management § User-‐level API for managing physical memory, building virtual address spaces § User-‐level API for crea/ng tasks, which run in virtual address spaces
§ Small, highly reliable code base § Focused on scalable HPC applica/ons
§ Low noise § Small memory footprint
§ Open source and freely available
Palacios Virtual Machine Monitor § OS-‐independent embeddable virtual machine monitor
§ Can be combined with KiZen or Linux
§ Full system virtualiza/on § No need to modify guest OS
§ Supports running mul/ple guests concurrently § Makes extensive use of virtualiza/on extensions in modern
Intel and AMD x86 processors § Passthrough resource par//oning § Extensive configurability § Low noise § Open source and freely available § Small, highly reliable code base
Systems Are Converging to Reduce Data Movement
§ External parallel file system is being subsumed § Near-‐term capability systems using NVRAM-‐based burst buffer § Future extreme-‐scale systems will con/nue to exploit persistent
memory technologies
§ In-‐situ and in-‐transit approaches for visualiza/on and analysis § Can’t afford to move data to separate systems for processing § GPUs and many-‐core processors are ideal for visualiza/on and some
analysis func/ons
§ Less differen/a/on between advanced technology and commodity technology systems § On-‐chip integra/on of processing, memory, and network § Summit/Sierra using InfiniBand
Exascale System
Capability System
Analytics Cluster
Parallel File System Visualization
Cluster
Capacity Cluster
Applica/ons and Usage Models are Diverging § Applica/on composi/on becoming more important
§ Ensemble calcula/ons for uncertainty quan/fica/on § Mul/-‐{material, physics, scale} simula/ons § In-‐situ analysis and graph analy/cs § Performance and correctness analysis tools
§ Applica/ons may be composed of mul/ple programming models § More complex workflows are driving need for advanced OS services and
capability § “Workflow” overtaken “Co-‐Design” as most popular DOE buzzword J
§ Desire to support “Big Data” applica/ons § Significant somware stack comes along with this
§ Support for more interac/ve workloads § Requirements are independent of programming model and hardware
Mul/physics Example
Mul/physics Example (cont’d)
Mul/physics Example (concl’d)
ENCLAVE COMPOSITION
A Deeper Look at Composi/on
Intra-‐Node Composi:on § Components co-‐located on same
set of nodes § Isolate NOS environments on
each node § Composi/on (coupling) takes
place via shared memory
Inter-‐Node Composi:on § Components deployed to
separate sets of nodes § Composi/on (coupling) takes
place via network
Composi/on of Enclaves
§ An enclave provides a single OS/R environment to the applica/on § Hobbes approach is to provide the minimum “amount” of OS/R required
by the applica/on (do what is necessary and get out of the way!) § Modern, complex applica/ons are increasingly created by assembling
(omen substan/al) somware components § E.g., analy/cs connected to applica/ons, code coupling, applica/on
frameworks, …
§ Components may have dis/nct requirements for OS/R support § Two op/ons: § Assemble an all-‐in-‐one OS/R stack that sa/sfies all component needs
§ Poten/al challenges at both OS and RTS levels § Requires integra/on work for every combina/on supported
§ Provide each component the OS/R it needs, and provide efficient, low-‐level mechanisms to connect the components (and the OS/Rs)
Composi/on Examples
§ SNAP + Analy/cs § “SNAP calculates synonymous and non-‐synonymous subs/tu/on rates
based on a set of codon-‐aligned nucleo/de sequences.” (HIV related) § Proxy app from LANL used for example
§ GTC-‐P + Analy/cs § Fusion simula/on tes/ng/proxy app used to test new hardware and
algorithm integra/on into the PIC model. (PPPL) § Analy/cs generate sta/s/cs on par/cles (histograms), sorts, and filters
on bounding boxes
§ LAMMPS + Analy/cs § Full, produc/on molecular dynamics applica/on from Sandia § Analy/cs look for crack forma/on by calcula/ng atomic spacing in
output data to change simula/on from coarse to fine grained.
Composi/on Approach
§ ADIOS or TCASM as user-‐facing data exchange API
§ XEMEM (cross-‐enclave shared memory) as low-‐level transport § XEMEM extends XPMEM with
global name service § Cross-‐enclave signaling under
development § KiZen lightweight kernel § Palacios high-‐performance
virtual machine monitor § Pisces co-‐kernel
architecture
§ Composi/on demonstrated with different OS combina/ons § Cray Compute Node Linux
(CNL) and stock Linux § CNL and KiZen § KiZen and KiZen
Kitten Co-Kernel (Pisces)
Hardware A
DIO S
XEM
EM
Hobbes Runtime
Application
Operating System
Simulation
Linux
TCA
SM
TCA
SM
AD
IO S
XEM
EM
Analytics
Composi/on Working Group Current Focus Areas
§ Composi/on use cases § Iden/fying use case categories and specific examples § Analyzing coupling requirements of specific use cases § Iden/fying abstrac/ons necessary for user-‐ and system-‐level
composi/on
§ “Hobbes composi/on language” § How to describe a composite applica/on and its mapping onto system
resources (“job control language” for composite applica/on) § Gathering and understanding “related work”
§ E.g. workflow languages, cloud configura/on/deployment tools, CCA, ADIOS & TCASM APIs, constraint programming languages, …
Composi/on Working Group Goals
§ Working towards three basic API specifica/ons… § Applica:on-‐level composi:on API
§ Basic user-‐level abstrac/ons for cross-‐enclave data sharing § May be several dis/nct (sets of) abstrac/ons § Not necessarily implemented by/in Hobbes (e.g. ADIOS, TCASM, …)
§ System-‐level composi:on API § Used to write cross-‐enclave “transports” to be used by applica/on-‐
level composi/on libraries § Cross-‐enclave shared memory (example), cross-‐enclave signaling,
node-‐level name service § Node-‐level enclave management API
§ Set and query configura/on of enclaves on node § Resource alloca/on, VMM & OS, …
Enabling Mul/-‐OS/R Stack Applica/on Composi/on
§
In-situ Simulation + Analytics composition in single Linux OS vs. Multiple Enclaves
• Problem • HPC applications evolving to more compositional approach, overall application is a
composition of coupled simulation, analysis, and tool components • Each component may have different OS/R requirements, no “one-size-fits-all” OS/R stack
• Solution • Partition node-level resources into “enclaves”, run different OS/R instance in each enclave
Pisces Co-kernel Architecture: http://www.prognosticlab.org/pisces/ • Provide tools for creating and managing enclaves, launching applications into enclaves
Leviathan Node Manager: http://www.prognosticlab.org/leviathan/ • Provide mechanisms for cross-enclave application composition and synchronization
XEMEM Shared Memory: http://www.prognosticlab.org/xemem/
• Recent results • Demonstrated Multi-OS/R approach provides excellent
performance isolation; better than native performance possible • Demonstrated drop in compatibility with both commodity and
Cray Linux environments • Impact
• Application components with differing OS/R requirements can be composed together efficiently within a compute node, minimizing off-node data movement
• Compatible with unmodified vendor provided OS/R environments, simplifies deployment
• Problem – HPC applications are increasingly comprised of multiple
distinct components with different requirements for OS, software stack and system resources – E.g., simulation+analytics, coupled multiphysics, scalable
performance analysis and debugging
• Solution – Instantiate “enclaves” for each application component using
high-performance virtualization technology – Provide OS and software stack tailored for application component within each enclave – Provide mechanisms for controlled interaction between enclaves (components)
– Selective sharing of memory regions (data exchange) – Name service (discovery and rendezvous)
§ Recent results – Proof-of-principle for XEMEM cross-enclave memory API – Use XEMEM as “transport” in ADIOS, TCASM coupling tools – Demonstrate composite simulation+analytics applications using XEMEM
– Impact – Composition can be made transparent at the application level (no changes, performance neutral) – Allows detailed resource management and scheduling among enclaves (other Hobbes R&D areas)
System-‐Level Support for Composi/on of Applica/ons
NODE VIRTUALIZATION LAYER
Recent NVL Accomplishments § Demonstrated Node Virtualiza/on Layer (NVL) Func/onality
§ Improved inter-‐enclave isola/on (HPDC’15) § High-‐performance inter-‐enclave memory sharing (HPDC’15) § Cross-‐enclave code coupling (ROSS’15)
§ Hybrid VMM Basic Func/onality § Boo/ng on all cores of Knights Corner Phi (228 cores) § Demonstrated Hybrid VMM “boot” faster than na/ve Linux fork/exec
§ NVL Next Steps in Development § Libhobbes – library combining NVL client func/onality under “one roof” § Mul/-‐node support -‐-‐ NVL driver for Cray Gemini network § Integra/on of TCASM with XPMEM to enable another form of cross
enclave code coupling
NVL Architecture
Linux
Hardware
Isolated Virtual Machine
Applications +
Virtual MachinesPalacios VMM
Kitten Co-kernel (1)
Kitten Co-kernel(2)
Isolated Application
Pisces Pisces
§ Co-‐kernel Architecture: Mul/ple OS kernels run side-‐by-‐side on same node in different enclaves
§ Pisces infrastructure used to launch and manage encalves and bind enclaves together
§ XEMEM mechanism developed to enable cross-‐enclave memory sharing
Hardware'Par))on' Hardware'Par))on'
User%Context%
Kernel%Context% Linux'
Cross%Kernel*Messages*
Control'Process'
Control'Process'
Shared*Mem**Ctrl*Channel*
Linux'Compa)ble'Workloads'
Isolated'Processes''
+'Virtual'
Machines'
Shared*Mem*Communica6on*Channels*
Ki@en'CoAKernel'
1. Co-Kernel Architecture, Three Enclave Example
2. Cross-enclave communication used for enclave control and for cross-enclave app code coupling
NVL Provides Excellent Inter-‐Enclave Isola/on Native Linux, Single Linux Image
Kitten Enclave with Same Competing Workload
§ Co-‐kernel Architecture nearly eliminates OS-‐induced interference
§ When using single Linux OS (top), compe/ng workload induces noise on other processes, even when they are pinned to disjoint cores and memory
§ Isola/ng processes in a separate KiZen enclave (boZom) eliminates this interference
Excellent Isola/on of NVL/Co-‐Kernel Arch Leads to Increased Performance and Reduced Run-‐to-‐run Variability
Comparison of Mini-app/Benchmark Performance With and Without a Competing Background Workload (Kernel Compile)
Hybrid VMM: Overview
§ Terminology § Hybrid Run-‐Time (HRT) is a run-‐/me + applica/on that run en/rely at
kernel-‐level § Regular Opera/ng System (ROS) is full blown OS stack (e.g., Linux) § Hybrid Virtual Machine (HVM) allows a single VM to contain both a
ROS and an HRT simultaneously, giving each a dis/nct view (and access) to the resources, as well as dis/nct interfaces to the VMM.
§ We have developed the core framework for enabling HRTs (with and without HVM) as well as an ini/al implementa/on of the HVM
Hybrid VMM: Model Parallel App
Parallel Run-‐/me
General Kernel
Node HW
User Mode
Kernel Mode
Parallel App
Hybrid Run-‐/me (HRT)
Node HW
Kernel Mode
Parallel Run-‐/me
General Kernel
Node HW
User Mode
Kernel Mode
Parallel App
Hybrid Run-‐/me (HRT)
User Mode
Kernel Mode
Hybrid Virtual Machine (HVM)
Specialized Virtualiza/on Model
General Virtualiza/on Model
Performan
ce Path
Parallel App
Legacy Path
(a) Current Model (b) Hybrid Run-time Model
(c) Hybrid Run-time Model Within a Hybrid Virtual Machine
Performan
ce Path
Hybrid VMM: Nau/lus Kernel Framework
§ A new kernel developed specifically to enable crea/on and por/ng of HRTs § Kyle Hale’s thesis work
§ Currently running on: § x64 bare metal (64 cores) § Intel Xeon Phi (3120A, 228 cores) § Palacios HVM
§ With ports of: § Legion run-‐/me and circuit
simula/on applica/on § NESL run-‐/me (VCODE engine) § NDPC (home-‐grown nested data
parallel language)
§ ~24K SLOC (C,C++,Assembly) § <1K SLOC needed to enable RT
ports
Hybrid VMM: Nau/lus Ini/al Performance (x64)
0
1x106 2x106 3x106 4x106 5x106 6x106 7x106
2 4 8 16 32 64
Cyc
les
Threads
(a)
NautilusLinux
0
1x106 2x106 3x106 4x106 5x106 6x106 7x106
2 4 8 16 32 64
Cyc
les
Threads
(b)
NautilusLinux
0
1x106 2x106 3x106 4x106 5x106 6x106 7x106
2 4 8 16 32 64
Cyc
les
Threads
(c)
NautilusLinux
8 10 12 14 16 18 20 22 24 26
2 4 8 16 32 64
Speedup
Threads
(d)
10
15
20
25
30
35
40
2 4 8 16 32 64
Speedup
Threads
(e)
0 5
10 15 20 25 30 35 40 45
2 4 8 16 32 64
Speedup
Threads
(f)
Figure 4: Average (a), minimum (b), and maximum (c) time to create a number of threads in sequence. Average (d), minimum(e), and maximum (f) speedup of Nautilus over Linux for multiple thread creations.
single nodes as core counts continue to scale up. The intro-duction of variance by OS noise (not just by asynchronouspaging events) not only limits the performance and pre-dictability of existing run-times, but also limits the kindsof run-times that can take advantage of the machine. Forexample, run-times that need tasks to execute in synchrony(e.g., in order to support a bulk-synchronous parallel appli-cation or a run-time that uses an abstract vector model) willexperience serious degradation if OS noise comes into play.
The use of a single unified address space also allows veryfast communication between threads, and eliminates muchof the overhead of context switches when Nautilus bootswith preemption enabled. The only preemption is betweenkernel threads, so no page table switch ever occurs. This isespecially useful when Nautilus runs virtualized, as a largeportion of VM exits come from paging related faults anddynamic mappings initiated by the OS, particularly usingshadow paging. A shadow-paged Nautilus exhibits the min-imum possible shadow page faults, and shadow paging canbe more e�cient that nested paging, except when shadowpage faults are common.
Events Events are a common abstraction that run-timesystems often use to distribute work to execution units, orworkers. The Legion run-time makes heavy use of them, sowe wanted to make sure that Nautilus provided an e�cientimplementation of them. In Legion, the events are usedto notify logical processors (Legion threads) when there aretasks ready to execute. To help show the potential of Legion+ Nautilus as an HRT, we measured the performance ofthese “wakeup” events.
Figure 6 shows the average latency between an event no-tification and the subsequent wakeup. Here, we had a singlethread on one core go to sleep and wait for an event no-tification from a thread running on the adjacent physicalcore. The latency is measured in cycles and the average istaken over 100 runs. The first box on the left shows thelatency of a common mechanism used in Linux for eventnotification, the pthread implementation of condition vari-
0
5000
10000
15000
20000
25000
30000
Linux N. MWAIT N. condvar N. w/kick
Cyc
les
not available in userspace
overhead too high in userspace
Figure 6: Average event wakeup latency.
ables. In this case, the measurement is the time betweencalling pthread_cond_signal and the subsequent wakeupfrom pthread_cond_wait. A wakeup takes about 25000 cy-cles. The following three boxes show various Nautilus imple-mentations of event notification. “N.MWAIT” shows the la-tency when using the newer MONITOR/MWAIT extensions pro-vided by modern processors. These instructions allow onethread to go to sleep on a range of memory, waiting for awrite to that memory by another thread. Note that theMONITOR/MWAIT instructions are not available in user-space.While the latency improves considerably over pthread’s con-dition variables in Linux user-space, we suspect it is limitedby the hardware latency incurred when waking up from asleep state. The MONITOR/MWAIT extensions provide optionalhints to enter lower sleep states and allow faster wakeups,but our machine does not support this feature.The final two boxes (“N. condvar” and “N. w/kick”) show
the Nautilus implementation of condition variables. Theyare very lightweight, and a signal will essentially just en-queue the waiting thread on a processor’s run queue. This
§ Microbenchmarks § Thread crea/on 11x faster than
Linux/pthreads § Including thread fork capability
§ Event wakeup 5x faster than Linux/pthreads (right)
§ Macrobenchmarks § Legion circuit simula/on benchmark
already 5% faster at 64 cores
§ Both due to leveraging kernel-‐mode only features and avoiding all kernel/user transi/ons
Latency of Waking up a Thread on a Remote Core
Hybrid VMM: HVM in Palacios § Par//ons VM’s cores into ROS cores and HRT cores § HRT core view
§ Specialized fast bootstrap of HRT ELF (in microseconds) § No BIOS, etc. § Immediate startup in preconfigured long mode environment (co-‐designed with Nau/lus) § Extended Mul/boot2 model § Environment leads to few VM exits during typical bootup/execu/on § Possible prebuilt VM structures (e.g. page tables) and selec/ve direct hardware access
further reduce VM exits § Not a target of typical interrupts, etc. § All memory of VM accessible § All APICs of VM accessible § Separately rebootable from ROS cores
§ ROS core view § Tradi/onal VM’s view (BIOS boot, ACPI, devices, etc), just smaller § Not all memory of VM accessible/visible (HRT can hide memory) § Not all APICs accessible (HRT cores’ APICs cannot be targeted)
§ ~2K SLOC addi/ons to Palacios (so far)
Hybrid VMM: HVM in Palacios
§ Ini/al bootstrap /mings show HRT core reboot /me is similar to fork() and exec()
Item Cycles (exits) Time (AMD 4122)
HRT core boot of Nautilus to main()
~135K (7 exits) 61 uS
Linux fork() ~320K 145 uS
Linux exec() ~1M 476 uS
Linux fork()+exec() ~1.5M 714 uS
HRT core boot of Nautilus to idle thread
~37M (~2300 exits) 17 ms
Legion on KiZen
§ Successfully integrated a shared-‐memory version of the Legion run/me into the KiZen environment with minimal effort.
§ Successfully ran LGNCG, a Legion port of the HPCG benchmark code, on KiZen (single node).
§ Distributed LGNCG with Legion on KiZen environment to Peter Dinda For his Legion inves/ga/on.
GLOBAL INFORMATION BUS
GIB Ac/vi/es
§ GIB “design-‐a-‐thon” mee/ng 3/18/15, hosted by MaZhew Wolf at GaTech § Representa/on from both ARGO and Hobbes § Status updates from project par/cipants, including:
§ BEACON over EVPath § LAMMPS composi/on use case from Hobbes ROSS paper
§ Discussion of poten/al GIB data store founda/ons § Sandia’s Kelpie, several open source packages, and Proac/ve Data Store (PDS)/
Drim § Reports about preliminary experience and experimenta/on with some open
source packages e.g., Redis and mongoDB § Decided to pursue design using PDS as founda/on
§ Detailed discussion of GIB use cases § Boot § System monitoring § Composed applica/on (launch, normal opera/on, response to failures/faults,
graceful shutdown)
GIB Ac/vi/es (II)
§ Post-‐mee/ng ac/ons § Created repositories for mee/ng documents, data store source code/
documenta/on § [ongoing] Conver/ng whiteboard snapshots describing use case
discussions into electronic documents for dissemina/on and review § ARGO/Hobbes telecon 4/13/15
§ Outbrief for people who weren’t in aZendance at GaTech mee/ng § More discussion regarding data store interface and implementa/on
§ Sugges/on to consider Riak, leading to some explora/on with Riak open source version
§ Ques/ons about PDS and its integra/on into EVPath § Ongoing ac/vi/es
§ Design and preliminary implementa/on of data store § Explora/on of using Riak § Explora/on of integra/ng PDS with BEACON
§ Development of GIB use case document
Support for extreme-‐scale OS/R monitoring and control
§ • Problem
– Operating system/runtime (OS/R) components running throughout system must be monitored and controlled, but extreme system scale makes it difficult to do so (too much data, and/or too many “hops” to get data from one part of system to another)
• Solution – Integrate scalable, distributed data store with publish and subscribe service in a Global Information
Bus (GIB) – Interface with Hobbes Leviathan
node-level resource manager
– Recent progress – Defined important GIB use cases
– System boot – Launch application – Respond to application failure – Respond to application termination
– Designed and began pilot implementation of integration of distributed data store based on Riak open source database, BEACON publish-subscribe software from ARGO project, and Leviathan
• Impact – Supports monitoring and control of a large number of system software components without
excessive application intrusion – Usable by both Hobbes and ARGO projects
Data$Store$ Data$Store$Data$Store$
Leviathan$
Applica2on(s)$
BEACON$
Leviathan$
Applica2on(s)$
BEACON$
Leviathan$
Applica2on(s)$
BEACON$
BEACON$ BEACON$ BEACON$
GIB data store and publish/subscribe components Dashed lines indicate potential notifications from publishers to subscribers
RESILIENCE
Hobbes Resilience Effort § OS/R data structure resilience
§ Hashes, redundant data, self checks, and self repairs § HR/HI component: Rejuvena/on/migra/on
§ OS/R resilience building blocks (reuse exis/ng solu/ons) § Membership management protocols with different consistency § Persistent state management (resilient distr. key/value stores) § Reliable/unreliable publish/subscribe event no/fica/on APIs
§ Tunable resilience and cross-‐layer/-‐enclave coordina/on § Cost model of different cross-‐layer/-‐enclave resilience choices § Inter-‐enclave workflow coordina/on that annotates data § HR/HI component: Interfaces for autonomic management
§ OS/R support for fault sensi/vity and coverage analysis § HR/HI component: Support for fault injec/on and controlled experiments
Hobbes Resilience Update § OS/R data structure resilience
§ Tracking memory alloca:ons within the Linux OS to record supplemental loca/on info. on generic alloca/ons, including the reques/ng module and source file/line
§ Protec:ng a basic slab memory allocator from corrup:on using data corrup/on detec/on and correc/on based on already exis/ng pointer redundancy
§ Transparently store resilience metadata alongside dynamically allocated OS data for providing data structure resilience on an as needed basis.
§ OS/R support for fault sensi/vity and coverage analysis § Review of the current status of the Linux fault injec:on support revealed great
poten/al for extension to permit injec/on of errors in specific Hobbes OS subsystems
§ Ini:al work focused on injec:ng fatal failures in the guest OS to iden/fy the isola/on capabili/es of the virtualiza/on environments KVM, QEMU and Palacios
§ Testbed system for Hobbes resilience and other Hobbes effort § Deployed KiQen and Palacios (with Linux as host OS) on bare hardware and on
QEMU on a 960-‐core computer science research cluster at ORNL
mini-‐ckpts: Surviving OS Failures in Persistent Memory
§ • Problem
– A failure of the operating system (OS) causes a failure of an otherwise healthy HPC application
• Solution – Execute the application in persistent memory (PRAMFS
in DRAM) that is able to survive OS failures and reboots – Track OS state used by the application and MPI for recovery – Rejuvenate (warm reboot) the OS in case of a failure – Restore tracked OS state used by the application and MPI – Transparently continue to execute the application in
persistent memory without loss of state/progress
– Recent results – Prototype implementation supports OpenMP and
MPI applications with certain limitations – OS rejuvenation and recovery takes 3-6 seconds – Failure-free runtime overhead is of 3-5% for a
number of key HPC workloads • Impact
– First solution that transparently offers OS failure tolerance without loss of state/progress – Transparently handling OS failures locally reduces the need for global checkpoint/restart – Latent OS errors that have not resulted in a failure can be cleared by rejuvinating the OS
0
50
100
150
200
NPB CG
NPB LU
NPB EP
NPB IS
PENNANT
Clover Leaf
Runt
ime
in S
econ
ds
Open MPIlibrlmpi
librlmpi (with PRAMFS remap)librlmpi (mini-ckpts & PRAMFS)
0
10
20
30
40
50
0 1 2 3 4
Addi
tiona
l Run
time
in S
econ
ds
Number of Kernel Panic Injections
Same Target CGAlternating Target CG
Same Target ISAlternating Target IS
Same Target LUAlternating Target LU
hZp://xstack.sandia.gov/hobbes