Nautilus: A Case for Transforming Parallel Runtimes into ...

Nautilus: A Case for Transforming Parallel Runtimes into Operating System Kernels Kyle C. Hale and Peter Dinda | {k-hale,pdinda}@northwestern.edu

We present the Hybrid Runtime (HRT), a transformation of a traditional parallel runtime into a specialized operating system kernel, and the Hybrid Virtual Machine (HVM), which makes it possible to create VMs that are internally partitioned between a “regular OS” and an HRT HRTs enjoy unfettered access to the hardware and determine their own abstractions to that hardware. There are no longer hidden avenues that allow the OS to get in the way We introduce Nautilus, a minimal OS kernel framework for creating these HRTs. Nautilus runs both on commodity x86_64 hardware and on the Xeon Phi coprocessor We evaluate the Nautilus framework itself and describe two runtimes (NESL and Legion) that we transformed into HRTs and one (NDPC) that we built from scratch to support the HRT model

Introducing Hybrid Runtimes Nautilus Kernel Framework

•  HRTs can be created very quickly (we transformed all three runtimes into HRTs in ~4 months) •  HRTs do not have to suffer from OS noise and general non-determinism: they support only the

OS constructs that they need •  With an HRT-based kernel framework like Nautilus, HRTs can start out with a simple,

minimal skeleton for the execution environment and complexity can be added by runtime developers

•  HRTs do not suffer from the “feature creep” that handicaps many general-purpose OSes •  More complex functionality, especially Linux compatibility, can be facilitated with the HVM,

as in the figure below

The power of HRTs and the HVM

[1] K.C. Hale, P. Dinda. A Case for Transforming Parallel Runtimes into Operating System Kernels. In Proceedings of

the 24th International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC 2015). [2] J. Lange, P. Dinda, K. Hale, L. Xia. An Introduction to the Palacios Virtual Machine Monitor Release 1.3. Tech.

Rep. NWU-EECS-11-10, Dept. of EECS, Northwestern Univ. (2011). [3] D. Engler and M. Kaashoek. Exterminate all Operating System Abstractions. In Proceedings of the 5th Workshop on

Hot Topics in Operating Systems (HotOS 1995)

References

This project is made possible by support from Sandia National Laboratories through the Hobbes Project, which is funded by the 2013 Exascale Operating and Runtime Systems Program under the Office of Advanced Scientific Computing Research in the DOE Office of Science. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Acknowledgements

•  Legion is a complicated parallel runtime that is gaining traction in the HPC community •  Shares many of the same design goals that an OS does: we received several complaints from the

developers that OS abstractions were getting in the way of the hardware •  We ported Legion to the HRT model on top of Nautilus, and with a very reasonable effort observed

promising results (below)

The Legion runtime as an HRT

•  Nautilus is designed to be simple, fast, and extensible •  Left: Nautilus comprises a very thin layer on top of the hardware: the HRT

has ultimate control over mechanism and policy •  Nautilus is in many ways similar to a libOS and is heavily influenced by

exokernel Runtime

Paging Threads Bootstrap Timers

Hardware

IRQs Console Nautilus Topology Sync.

Parallel Application Kernel Mode

User Mode

HRT

Kernel

Events

Parallel&App&

Parallel&Run,-me&

General&Kernel&

Node&HW&

User%Mode%

Kernel%Mode%

Parallel&App&

Hybrid&Run,-me&(HRT)&

Node&HW&

User%Mode%

Kernel%Mode%

Parallel&Run,-me&

General&Kernel&

Node&HW&

User%Mode%

Kernel%Mode%

Parallel&App&

Hybrid&Run,-me&(HRT)&

User%Mode%

Kernel%Mode%

Hybrid&Virtual&Machine&(HVM)&

Specialized&Virtualiza-on&Model&

General&Virtualiza-on&Model&

Performan

ce*Path*

Parallel&App&

Legacy*Path*

(a) Current Model (b) Hybrid Run-time Model

(c) Hybrid Run-time Model Within a Hybrid Virtual Machine

Performan

ce*Path*

HRT structure with Nautilus

NESL and NDPC as HRTs

•  We designed and built Nautilus, a kernel framework for creating hybrid runtimes (HRTs)

•  We showed how HRTs can be extended to interact with a legacy environment using the hybrid virtual machine (HVM)

•  We showed the feasibility and simplicity of creating HRTs using Nautilus, with three proofs-of-concept, one of which is a large and complex parallel runtime: Legion

•  We showed that the HRT model can produce promising performance just by virtue of the HRT’s structure and with modifications that leverage the direct access to hardware

Summary

0

5000

10000

15000

20000

25000

30000

Linux N. MWAIT N. condvar N. w/kick

Cyc

les

Specialized event wakeup (not possible in userspace)extremely light-weight threads in Nautilus

thread creation in Linux and Nautilus

•  Nautilus has a simple set of primitives that HRTs can adopt and extend for their own needs

•  Primitives such as thread creation can be simpler given their intended use in specialized environments

•  Nautilus has no notion of a process—only threads. The intent of this choice—and many others—is to expose a simple interface to HW and avoid the imposition of policy and cumbersome OS mechanisms

Language SLOC

C 22697

C++ 133

X86 Assembly

428

Scrip9ng 706

Language SLOC

C++ 133

C 636

Language SLOC

Compiler

Lisp 11005

Run2me

C 8853

lex 230

yacc 461

Language SLOC

Compiler

Perl 2000

yacc 236

lex 82

Run2me

C 2900

C++ 93

X86 Asembly 477

complexity of the Nautilus kernel framework

Item Cycles (and exits) Time

HRT core boot of Nau9lus to main()

~135 K (7 exits) 61 µs

Linux fork() ~320 K 145 µs

Linux exec() ~1 M 476 µs

Linux fork() + exec() ~1.5 M 714 µs

HRT core boot of Nautilus to idle thread

~37 M (~2300 exits) 17 ms

booting Nautilus in an HVM + HRT setup quicker than a Linux exec()

code added to Nautilus to support Legion, NDPC, and NESL

as HRTs

Left: by simply porting Legion to the HRT model and with very little optimization, we achieved parity with Linux—a heavily optimized and tuned OS

Right: to see how a very simple modification allowing direct HW access might affect performance, we turned off interrupts during the Legion task selection loop. The results show promise for exploring further, more significant optimizations tailored to direct hardware access and the HRT model

latency of event wakeups for several implementations

lines of code for NDPC and NESL, respectively, as HRTs

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

2 4 8 16 32 64

Cyc

les

Nautilus

Linux (pthreads)

•  NESL is a nested data-parallel language that allows programs with nested parallelism to be flattened for operation on vector machines

•  NDPC is an implementation of a subset of NESL that compiles to C++ and uses fork/join parallelism instead of NESL’s flattened parallelism

•  As a proof-of-concept, we ported both NESL and NDPC to the HRT model with Nautilus in a very reasonable amount of time

0

1

2

3

4

5

6

1 2 4 8 16 32 64 128 200 220

Execu2

on Tim

e (s)

Legion Processor Count (Cores)

Natuilus

Linux

0%

5%

10%

15%

20%

25%

1 2 4 8 16 32 64 128 200 220

Spe

ed

up o

ver L

inux

Legion Processor Count (Cores)

Nautilus: A Case for Transforming Parallel Runtimes into ...

Documents

Transcript of Nautilus: A Case for Transforming Parallel Runtimes into ...