Hobbes:’’ OS’and’Run/me’Supportfor’...

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Hobbes: OS and Run/me Support for Applica/on Composi/on

Ron Brightwell Coordina/ng PI X-‐Stack/OSR PI Mee/ng December 7-‐8, 2015

Hobbes Team Institution Person Role

Georgia Institute of Technology Ada Gavrilovska PI

Indiana University Thomas Sterling PI

Los Alamos National Lab Mike Lang PI

Lawrence Berkeley National Lab Costin Iancu PI

North Carolina State University Frank Mueller PI

Northwestern University Peter Dinda PI

Oak Ridge National Laboratory David Bernholdt PI

Oak Ridge National Laboratory Arthur B. Maccabe Chief Scientist

Sandia National Laboratories Ron Brightwell Coordinating PI

University of Arizona David Lowenthal PI

University of California – Berkeley Eric Brewer PI

University of New Mexico Patrick Bridges PI

University of Pittsburgh Jack Lange PI

Project Goals §  Deliver prototype OS/R environment for R&D in extreme-‐

scale scien/fic compu/ng §  Focus on applica/on composi/on as a fundamental driver

§  Develop necessary OS/R interfaces and system services required to support resource isola/on and sharing

§  Support complex simula/on and analysis workflows

§  Provide a lightweight OS/R environment with flexibility to build custom run/mes §  Compose applica/ons from a collec/on of enclaves

§  Leverage KiZen lightweight kernel and Palacios lightweight virtual machine monitor §  Enable high-‐risk high-‐impact research in virtualiza/on, energy/power,

scheduling, and resilience

KiZen Lightweight Kernel §  Monolithic, C code, GNU toolchain, Kbuild configura/on §  Supports x86-‐64 and ARM-‐64

§  Boots on standard PC architecture, Cray XT, and in virtual machines §  Boots iden/cally to Linux (KiZen bzImage and init_task)

§  Repurposes basic func/onality from Linux §  Hardware bootstrap §  Basic OS kernel primi/ves (lists, locks, wait queues, etc.) §  Directory structure similar to Linux, arch dependent/independent dirs

§  Custom address space management and task management §  User-‐level API for managing physical memory, building virtual address spaces §  User-‐level API for crea/ng tasks, which run in virtual address spaces

§  Small, highly reliable code base §  Focused on scalable HPC applica/ons

§  Low noise §  Small memory footprint

§  Open source and freely available

Palacios Virtual Machine Monitor §  OS-‐independent embeddable virtual machine monitor

§  Can be combined with KiZen or Linux

§  Full system virtualiza/on §  No need to modify guest OS

§  Supports running mul/ple guests concurrently §  Makes extensive use of virtualiza/on extensions in modern

Intel and AMD x86 processors §  Passthrough resource par//oning §  Extensive configurability §  Low noise §  Open source and freely available §  Small, highly reliable code base

Systems Are Converging to Reduce Data Movement

§  External parallel file system is being subsumed §  Near-‐term capability systems using NVRAM-‐based burst buffer §  Future extreme-‐scale systems will con/nue to exploit persistent

memory technologies

§  In-‐situ and in-‐transit approaches for visualiza/on and analysis §  Can’t afford to move data to separate systems for processing §  GPUs and many-‐core processors are ideal for visualiza/on and some

analysis func/ons

§  Less differen/a/on between advanced technology and commodity technology systems §  On-‐chip integra/on of processing, memory, and network §  Summit/Sierra using InfiniBand

Exascale System

Capability System

Analytics Cluster

Parallel File System Visualization

Cluster

Capacity Cluster

Applica/ons and Usage Models are Diverging §  Applica/on composi/on becoming more important

§  Ensemble calcula/ons for uncertainty quan/fica/on §  Mul/-‐{material, physics, scale} simula/ons §  In-‐situ analysis and graph analy/cs §  Performance and correctness analysis tools

§  Applica/ons may be composed of mul/ple programming models §  More complex workflows are driving need for advanced OS services and

capability §  “Workflow” overtaken “Co-‐Design” as most popular DOE buzzword J

§  Desire to support “Big Data” applica/ons §  Significant somware stack comes along with this

§  Support for more interac/ve workloads §  Requirements are independent of programming model and hardware

Mul/physics Example

Mul/physics Example (cont’d)

Mul/physics Example (concl’d)

ENCLAVE COMPOSITION

A Deeper Look at Composi/on

Intra-‐Node Composi:on §  Components co-‐located on same

set of nodes §  Isolate NOS environments on

each node §  Composi/on (coupling) takes

place via shared memory

Inter-‐Node Composi:on §  Components deployed to

separate sets of nodes §  Composi/on (coupling) takes

place via network

Composi/on of Enclaves

§  An enclave provides a single OS/R environment to the applica/on §  Hobbes approach is to provide the minimum “amount” of OS/R required

by the applica/on (do what is necessary and get out of the way!) §  Modern, complex applica/ons are increasingly created by assembling

(omen substan/al) somware components §  E.g., analy/cs connected to applica/ons, code coupling, applica/on

frameworks, …

§  Components may have dis/nct requirements for OS/R support §  Two op/ons: §  Assemble an all-‐in-‐one OS/R stack that sa/sfies all component needs

§  Poten/al challenges at both OS and RTS levels §  Requires integra/on work for every combina/on supported

§  Provide each component the OS/R it needs, and provide efficient, low-‐level mechanisms to connect the components (and the OS/Rs)

Composi/on Examples

§  SNAP + Analy/cs §  “SNAP calculates synonymous and non-‐synonymous subs/tu/on rates

based on a set of codon-‐aligned nucleo/de sequences.” (HIV related) §  Proxy app from LANL used for example

§  GTC-‐P + Analy/cs §  Fusion simula/on tes/ng/proxy app used to test new hardware and

algorithm integra/on into the PIC model. (PPPL) §  Analy/cs generate sta/s/cs on par/cles (histograms), sorts, and filters

on bounding boxes

§  LAMMPS + Analy/cs §  Full, produc/on molecular dynamics applica/on from Sandia §  Analy/cs look for crack forma/on by calcula/ng atomic spacing in

output data to change simula/on from coarse to fine grained.

Composi/on Approach

§  ADIOS or TCASM as user-‐facing data exchange API

§  XEMEM (cross-‐enclave shared memory) as low-‐level transport §  XEMEM extends XPMEM with

global name service §  Cross-‐enclave signaling under

development §  KiZen lightweight kernel §  Palacios high-‐performance

virtual machine monitor §  Pisces co-‐kernel

architecture

§  Composi/on demonstrated with different OS combina/ons §  Cray Compute Node Linux

(CNL) and stock Linux §  CNL and KiZen §  KiZen and KiZen

Kitten Co-Kernel (Pisces)

Hardware A

Hobbes Runtime

Application

Operating System

Simulation

Analytics

Composi/on Working Group Current Focus Areas

§  Composi/on use cases §  Iden/fying use case categories and specific examples §  Analyzing coupling requirements of specific use cases §  Iden/fying abstrac/ons necessary for user-‐ and system-‐level

composi/on

§  “Hobbes composi/on language” §  How to describe a composite applica/on and its mapping onto system

resources (“job control language” for composite applica/on) §  Gathering and understanding “related work”

§  E.g. workflow languages, cloud configura/on/deployment tools, CCA, ADIOS & TCASM APIs, constraint programming languages, …

Composi/on Working Group Goals

§  Working towards three basic API specifica/ons… §  Applica:on-‐level composi:on API

§  Basic user-‐level abstrac/ons for cross-‐enclave data sharing §  May be several dis/nct (sets of) abstrac/ons §  Not necessarily implemented by/in Hobbes (e.g. ADIOS, TCASM, …)

§  System-‐level composi:on API §  Used to write cross-‐enclave “transports” to be used by applica/on-‐

level composi/on libraries §  Cross-‐enclave shared memory (example), cross-‐enclave signaling,

node-‐level name service §  Node-‐level enclave management API

§  Set and query configura/on of enclaves on node §  Resource alloca/on, VMM & OS, …

Enabling Mul/-‐OS/R Stack Applica/on Composi/on

In-situ Simulation + Analytics composition in single Linux OS vs. Multiple Enclaves

•  Problem •  HPC applications evolving to more compositional approach, overall application is a

composition of coupled simulation, analysis, and tool components •  Each component may have different OS/R requirements, no “one-size-fits-all” OS/R stack

•  Solution •  Partition node-level resources into “enclaves”, run different OS/R instance in each enclave

Pisces Co-kernel Architecture: http://www.prognosticlab.org/pisces/ •  Provide tools for creating and managing enclaves, launching applications into enclaves

Leviathan Node Manager: http://www.prognosticlab.org/leviathan/ •  Provide mechanisms for cross-enclave application composition and synchronization

XEMEM Shared Memory: http://www.prognosticlab.org/xemem/

•  Recent results •  Demonstrated Multi-OS/R approach provides excellent

performance isolation; better than native performance possible •  Demonstrated drop in compatibility with both commodity and

Cray Linux environments •  Impact

•  Application components with differing OS/R requirements can be composed together efficiently within a compute node, minimizing off-node data movement

•  Compatible with unmodified vendor provided OS/R environments, simplifies deployment

•  Problem –  HPC applications are increasingly comprised of multiple

distinct components with different requirements for OS, software stack and system resources –  E.g., simulation+analytics, coupled multiphysics, scalable

performance analysis and debugging

•  Solution –  Instantiate “enclaves” for each application component using

high-performance virtualization technology –  Provide OS and software stack tailored for application component within each enclave –  Provide mechanisms for controlled interaction between enclaves (components)

–  Selective sharing of memory regions (data exchange) –  Name service (discovery and rendezvous)

§  Recent results –  Proof-of-principle for XEMEM cross-enclave memory API –  Use XEMEM as “transport” in ADIOS, TCASM coupling tools –  Demonstrate composite simulation+analytics applications using XEMEM

– Impact –  Composition can be made transparent at the application level (no changes, performance neutral) –  Allows detailed resource management and scheduling among enclaves (other Hobbes R&D areas)

System-‐Level Support for Composi/on of Applica/ons

NODE VIRTUALIZATION LAYER

Recent NVL Accomplishments §  Demonstrated Node Virtualiza/on Layer (NVL) Func/onality

§  Improved inter-‐enclave isola/on (HPDC’15) §  High-‐performance inter-‐enclave memory sharing (HPDC’15) §  Cross-‐enclave code coupling (ROSS’15)

§  Hybrid VMM Basic Func/onality §  Boo/ng on all cores of Knights Corner Phi (228 cores) §  Demonstrated Hybrid VMM “boot” faster than na/ve Linux fork/exec

§  NVL Next Steps in Development §  Libhobbes – library combining NVL client func/onality under “one roof” §  Mul/-‐node support -‐-‐ NVL driver for Cray Gemini network §  Integra/on of TCASM with XPMEM to enable another form of cross

enclave code coupling

NVL Architecture

Hardware

Isolated Virtual Machine

Applications +

Virtual MachinesPalacios VMM

Kitten Co-kernel (1)

Kitten Co-kernel(2)

Isolated Application

Pisces Pisces

§  Co-‐kernel Architecture: Mul/ple OS kernels run side-‐by-‐side on same node in different enclaves

§  Pisces infrastructure used to launch and manage encalves and bind enclaves together

§  XEMEM mechanism developed to enable cross-‐enclave memory sharing

Hardware'Par))on' Hardware'Par))on'

User%Context%

Kernel%Context% Linux'

Cross%Kernel*Messages*

Control'Process'

Shared*Mem**Ctrl*Channel*

Linux'Compa)ble'Workloads'

Isolated'Processes''

+'Virtual'

Machines'

Shared*Mem*Communica6on*Channels*

Ki@en'CoAKernel'

1. Co-Kernel Architecture, Three Enclave Example

2. Cross-enclave communication used for enclave control and for cross-enclave app code coupling

NVL Provides Excellent Inter-‐Enclave Isola/on Native Linux, Single Linux Image

Kitten Enclave with Same Competing Workload

§  Co-‐kernel Architecture nearly eliminates OS-‐induced interference

§  When using single Linux OS (top), compe/ng workload induces noise on other processes, even when they are pinned to disjoint cores and memory

§  Isola/ng processes in a separate KiZen enclave (boZom) eliminates this interference

Excellent Isola/on of NVL/Co-‐Kernel Arch Leads to Increased Performance and Reduced Run-‐to-‐run Variability

Comparison of Mini-app/Benchmark Performance With and Without a Competing Background Workload (Kernel Compile)

Hybrid VMM: Overview

§  Terminology §  Hybrid Run-‐Time (HRT) is a run-‐/me + applica/on that run en/rely at

kernel-‐level §  Regular Opera/ng System (ROS) is full blown OS stack (e.g., Linux) §  Hybrid Virtual Machine (HVM) allows a single VM to contain both a

ROS and an HRT simultaneously, giving each a dis/nct view (and access) to the resources, as well as dis/nct interfaces to the VMM.

§  We have developed the core framework for enabling HRTs (with and without HVM) as well as an ini/al implementa/on of the HVM

Hybrid VMM: Model Parallel App

Parallel Run-‐/me

General Kernel

Node HW

User Mode

Kernel Mode

Parallel App

Hybrid Run-‐/me (HRT)

Node HW

Kernel Mode

Parallel Run-‐/me

General Kernel

Node HW

User Mode

Kernel Mode

Parallel App

Hybrid Run-‐/me (HRT)

User Mode

Kernel Mode

Hybrid Virtual Machine (HVM)

Specialized Virtualiza/on Model

General Virtualiza/on Model

Performan

ce Path

Parallel App

Legacy Path

(a) Current Model (b) Hybrid Run-time Model

(c) Hybrid Run-time Model Within a Hybrid Virtual Machine

Performan

ce Path

Hybrid VMM: Nau/lus Kernel Framework

§  A new kernel developed specifically to enable crea/on and por/ng of HRTs §  Kyle Hale’s thesis work

§  Currently running on: §  x64 bare metal (64 cores) §  Intel Xeon Phi (3120A, 228 cores) §  Palacios HVM

§  With ports of: §  Legion run-‐/me and circuit

simula/on applica/on §  NESL run-‐/me (VCODE engine) §  NDPC (home-‐grown nested data

parallel language)

§  ~24K SLOC (C,C++,Assembly) §  <1K SLOC needed to enable RT

Hybrid VMM: Nau/lus Ini/al Performance (x64)

1x106 2x106 3x106 4x106 5x106 6x106 7x106

2 4 8 16 32 64

Threads

NautilusLinux

1x106 2x106 3x106 4x106 5x106 6x106 7x106

2 4 8 16 32 64

Threads

NautilusLinux

1x106 2x106 3x106 4x106 5x106 6x106 7x106

2 4 8 16 32 64

Threads

NautilusLinux

8 10 12 14 16 18 20 22 24 26

2 4 8 16 32 64

Speedup

Threads

2 4 8 16 32 64

Speedup

Threads

10 15 20 25 30 35 40 45

2 4 8 16 32 64

Speedup

Threads

Figure 4: Average (a), minimum (b), and maximum (c) time to create a number of threads in sequence. Average (d), minimum(e), and maximum (f) speedup of Nautilus over Linux for multiple thread creations.

single nodes as core counts continue to scale up. The intro-duction of variance by OS noise (not just by asynchronouspaging events) not only limits the performance and pre-dictability of existing run-times, but also limits the kindsof run-times that can take advantage of the machine. Forexample, run-times that need tasks to execute in synchrony(e.g., in order to support a bulk-synchronous parallel appli-cation or a run-time that uses an abstract vector model) willexperience serious degradation if OS noise comes into play.

The use of a single unified address space also allows veryfast communication between threads, and eliminates muchof the overhead of context switches when Nautilus bootswith preemption enabled. The only preemption is betweenkernel threads, so no page table switch ever occurs. This isespecially useful when Nautilus runs virtualized, as a largeportion of VM exits come from paging related faults anddynamic mappings initiated by the OS, particularly usingshadow paging. A shadow-paged Nautilus exhibits the min-imum possible shadow page faults, and shadow paging canbe more e�cient that nested paging, except when shadowpage faults are common.

Events Events are a common abstraction that run-timesystems often use to distribute work to execution units, orworkers. The Legion run-time makes heavy use of them, sowe wanted to make sure that Nautilus provided an e�cientimplementation of them. In Legion, the events are usedto notify logical processors (Legion threads) when there aretasks ready to execute. To help show the potential of Legion+ Nautilus as an HRT, we measured the performance ofthese “wakeup” events.

Figure 6 shows the average latency between an event no-tification and the subsequent wakeup. Here, we had a singlethread on one core go to sleep and wait for an event no-tification from a thread running on the adjacent physicalcore. The latency is measured in cycles and the average istaken over 100 runs. The first box on the left shows thelatency of a common mechanism used in Linux for eventnotification, the pthread implementation of condition vari-

Linux N. MWAIT N. condvar N. w/kick

not available in userspace

overhead too high in userspace

Figure 6: Average event wakeup latency.

ables. In this case, the measurement is the time betweencalling pthread_cond_signal and the subsequent wakeupfrom pthread_cond_wait. A wakeup takes about 25000 cy-cles. The following three boxes show various Nautilus imple-mentations of event notification. “N.MWAIT” shows the la-tency when using the newer MONITOR/MWAIT extensions pro-vided by modern processors. These instructions allow onethread to go to sleep on a range of memory, waiting for awrite to that memory by another thread. Note that theMONITOR/MWAIT instructions are not available in user-space.While the latency improves considerably over pthread’s con-dition variables in Linux user-space, we suspect it is limitedby the hardware latency incurred when waking up from asleep state. The MONITOR/MWAIT extensions provide optionalhints to enter lower sleep states and allow faster wakeups,but our machine does not support this feature.The final two boxes (“N. condvar” and “N. w/kick”) show

the Nautilus implementation of condition variables. Theyare very lightweight, and a signal will essentially just en-queue the waiting thread on a processor’s run queue. This

§  Microbenchmarks §  Thread crea/on 11x faster than

Linux/pthreads §  Including thread fork capability

§  Event wakeup 5x faster than Linux/pthreads (right)

§  Macrobenchmarks §  Legion circuit simula/on benchmark

already 5% faster at 64 cores

§  Both due to leveraging kernel-‐mode only features and avoiding all kernel/user transi/ons

Latency of Waking up a Thread on a Remote Core

Hybrid VMM: HVM in Palacios §  Par//ons VM’s cores into ROS cores and HRT cores §  HRT core view

§  Specialized fast bootstrap of HRT ELF (in microseconds) §  No BIOS, etc. §  Immediate startup in preconfigured long mode environment (co-‐designed with Nau/lus) §  Extended Mul/boot2 model §  Environment leads to few VM exits during typical bootup/execu/on §  Possible prebuilt VM structures (e.g. page tables) and selec/ve direct hardware access

further reduce VM exits §  Not a target of typical interrupts, etc. §  All memory of VM accessible §  All APICs of VM accessible §  Separately rebootable from ROS cores

§  ROS core view §  Tradi/onal VM’s view (BIOS boot, ACPI, devices, etc), just smaller §  Not all memory of VM accessible/visible (HRT can hide memory) §  Not all APICs accessible (HRT cores’ APICs cannot be targeted)

§  ~2K SLOC addi/ons to Palacios (so far)

Hybrid VMM: HVM in Palacios

§  Ini/al bootstrap /mings show HRT core reboot /me is similar to fork() and exec()

Item Cycles (exits) Time (AMD 4122)

HRT core boot of Nautilus to main()

~135K (7 exits) 61 uS

Linux fork() ~320K 145 uS

Linux exec() ~1M 476 uS

Linux fork()+exec() ~1.5M 714 uS

HRT core boot of Nautilus to idle thread

~37M (~2300 exits) 17 ms

Legion on KiZen

§  Successfully integrated a shared-‐memory version of the Legion run/me into the KiZen environment with minimal effort.

§  Successfully ran LGNCG, a Legion port of the HPCG benchmark code, on KiZen (single node).

§  Distributed LGNCG with Legion on KiZen environment to Peter Dinda For his Legion inves/ga/on.

GLOBAL INFORMATION BUS

GIB Ac/vi/es

§  GIB “design-‐a-‐thon” mee/ng 3/18/15, hosted by MaZhew Wolf at GaTech §  Representa/on from both ARGO and Hobbes §  Status updates from project par/cipants, including:

§  BEACON over EVPath §  LAMMPS composi/on use case from Hobbes ROSS paper

§  Discussion of poten/al GIB data store founda/ons §  Sandia’s Kelpie, several open source packages, and Proac/ve Data Store (PDS)/

Drim §  Reports about preliminary experience and experimenta/on with some open

source packages e.g., Redis and mongoDB §  Decided to pursue design using PDS as founda/on

§  Detailed discussion of GIB use cases §  Boot §  System monitoring §  Composed applica/on (launch, normal opera/on, response to failures/faults,

graceful shutdown)

GIB Ac/vi/es (II)

§  Post-‐mee/ng ac/ons §  Created repositories for mee/ng documents, data store source code/

documenta/on §  [ongoing] Conver/ng whiteboard snapshots describing use case

discussions into electronic documents for dissemina/on and review §  ARGO/Hobbes telecon 4/13/15

§  Outbrief for people who weren’t in aZendance at GaTech mee/ng §  More discussion regarding data store interface and implementa/on

§  Sugges/on to consider Riak, leading to some explora/on with Riak open source version

§  Ques/ons about PDS and its integra/on into EVPath §  Ongoing ac/vi/es

§  Design and preliminary implementa/on of data store §  Explora/on of using Riak §  Explora/on of integra/ng PDS with BEACON

§  Development of GIB use case document

Support for extreme-‐scale OS/R monitoring and control

§  •  Problem

–  Operating system/runtime (OS/R) components running throughout system must be monitored and controlled, but extreme system scale makes it difficult to do so (too much data, and/or too many “hops” to get data from one part of system to another)

•  Solution –  Integrate scalable, distributed data store with publish and subscribe service in a Global Information

Bus (GIB) –  Interface with Hobbes Leviathan

node-level resource manager

–  Recent progress –  Defined important GIB use cases

–  System boot –  Launch application –  Respond to application failure –  Respond to application termination

–  Designed and began pilot implementation of integration of distributed data store based on Riak open source database, BEACON publish-subscribe software from ARGO project, and Leviathan

•  Impact –  Supports monitoring and control of a large number of system software components without

excessive application intrusion –  Usable by both Hobbes and ARGO projects

Data$Store$ Data$Store$Data$Store$

Leviathan$

Applica2on(s)$

BEACON$

Leviathan$

Applica2on(s)$

BEACON$

Leviathan$

Applica2on(s)$

BEACON$

BEACON$ BEACON$ BEACON$

GIB data store and publish/subscribe components Dashed lines indicate potential notifications from publishers to subscribers

RESILIENCE

Hobbes Resilience Effort §  OS/R data structure resilience

§  Hashes, redundant data, self checks, and self repairs §  HR/HI component: Rejuvena/on/migra/on

§  OS/R resilience building blocks (reuse exis/ng solu/ons) §  Membership management protocols with different consistency §  Persistent state management (resilient distr. key/value stores) §  Reliable/unreliable publish/subscribe event no/fica/on APIs

§  Tunable resilience and cross-‐layer/-‐enclave coordina/on §  Cost model of different cross-‐layer/-‐enclave resilience choices §  Inter-‐enclave workflow coordina/on that annotates data §  HR/HI component: Interfaces for autonomic management

§  OS/R support for fault sensi/vity and coverage analysis §  HR/HI component: Support for fault injec/on and controlled experiments

Hobbes Resilience Update §  OS/R data structure resilience

§  Tracking memory alloca:ons within the Linux OS to record supplemental loca/on info. on generic alloca/ons, including the reques/ng module and source file/line

§  Protec:ng a basic slab memory allocator from corrup:on using data corrup/on detec/on and correc/on based on already exis/ng pointer redundancy

§  Transparently store resilience metadata alongside dynamically allocated OS data for providing data structure resilience on an as needed basis.

§  OS/R support for fault sensi/vity and coverage analysis §  Review of the current status of the Linux fault injec:on support revealed great

poten/al for extension to permit injec/on of errors in specific Hobbes OS subsystems

§  Ini:al work focused on injec:ng fatal failures in the guest OS to iden/fy the isola/on capabili/es of the virtualiza/on environments KVM, QEMU and Palacios

§  Testbed system for Hobbes resilience and other Hobbes effort §  Deployed KiQen and Palacios (with Linux as host OS) on bare hardware and on

QEMU on a 960-‐core computer science research cluster at ORNL

mini-‐ckpts: Surviving OS Failures in Persistent Memory

§  •  Problem

–  A failure of the operating system (OS) causes a failure of an otherwise healthy HPC application

•  Solution –  Execute the application in persistent memory (PRAMFS

in DRAM) that is able to survive OS failures and reboots –  Track OS state used by the application and MPI for recovery –  Rejuvenate (warm reboot) the OS in case of a failure –  Restore tracked OS state used by the application and MPI –  Transparently continue to execute the application in

persistent memory without loss of state/progress

–  Recent results –  Prototype implementation supports OpenMP and

MPI applications with certain limitations –  OS rejuvenation and recovery takes 3-6 seconds –  Failure-free runtime overhead is of 3-5% for a

number of key HPC workloads •  Impact

–  First solution that transparently offers OS failure tolerance without loss of state/progress –  Transparently handling OS failures locally reduces the need for global checkpoint/restart –  Latent OS errors that have not resulted in a failure can be cleared by rejuvinating the OS

NPB CG

NPB LU

NPB EP

NPB IS

PENNANT

Clover Leaf

Open MPIlibrlmpi

librlmpi (with PRAMFS remap)librlmpi (mini-ckpts & PRAMFS)

0 1 2 3 4

Number of Kernel Panic Injections

Same Target CGAlternating Target CG

Same Target ISAlternating Target IS

Same Target LUAlternating Target LU

hZp://xstack.sandia.gov/hobbes

Hobbes:’’ OS’and’Run/me’Supportfor’...

Documents

Transcript of Hobbes:’’ OS’and’Run/me’Supportfor’...

Hobbes - Leviatã

Hobbes Locke Rousseau

Hobbes' Inductive Methodology

Hobbes characterization

Meet New Deans Student Affairs Tue Dec8

Rossello Hobbes

La Filosofia Hobbes

Leviatanas, Hobbes

Hobbes. mankind

Thomas Hobbes Name - MR. PORT'S SOCIAL STUDIES CLASSES · Thomas Hobbes Name: Nope, Not the Cartoon Tiger (the other Hobbes) Thomas Hobbes was an English scholar and philosopher.

Hobbes, Signification, and Insignificant Namesusers.clas.ufl.edu/sdrd/papers/duncan-hobbes-signification.pdf · 1! Hobbes, Signification, and Insignificant Names Stewart Duncan ABSTRACT.

Locke, Hobbes, Rousseau · Hobbes, Locke, Rousseau. Thomas Hobbes People cannot be trusted. Kings should rule! Thomas Hobbes nHobbes believed humans are naturally violent & disorderly;

Hobbes 809

Hobbes:’Composi,on’and’ Virtualizaon’as’the’ Foundaons’of ...Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a

Thomas Hobbes - Leviathan - PBworksassiduity.pbworks.com/f/Thomas+Hobbes+-+Leviathan.pdf · Name: _____ Frontispiece in Thomas Hobbes Leviathan. 1. "Non est potestas Super Terram

PS 226 Hobbes Notes 1 - Morals (1) Thomas Hobbes - 1588-1679.

Hobbes - Leviathan

Hobbes Touchard

Hobbes Power

Hobbes Text