Scalability and Heterogeneity - TU...
Transcript of Scalability and Heterogeneity - TU...
Faculty of Computer Science, Institute for System Architecture, Operating Systems Group
SCALABILITY AND HETEROGENEITY
Nils Asmussen
Dresden, 11/29/2016
Layers
Runtime, Services, ...
OS Kernel
Core Core
Mem Mem
Coherency
Interconnect
Application
Scalability and Heterogeneity Slide 2 of 39
Commodity System with GPU
Core Core
Mem Interconnect
Application
GPU
Mem
OS Kernel
Runtime
CC
Co
mp
ute
Ker
nel
Driver
OpenCL
HLSL
Scalability and Heterogeneity Slide 3 of 39
Current Trend
Kernel
Core
Mem Mem
Core Core
MemMem
RT
Kernel Application
??
??
Acc
Mem
Core
Scalability and Heterogeneity Slide 4 of 39
Why?
• More cores can (for some usecases) deliver moreperformance
• Specialization is the next step
• Cache coherency gets more expensive(performance, complexity and energy) with more(and heterogeneous) cores
Scalability and Heterogeneity Slide 5 of 39
Commodity Hardware
Scalability and Heterogeneity Slide 6 of 39
NUMA
Non-Uniform Memory Access
• Core-to-RAM distance differs
• Various interconnect topologies:bus, star, ring, mesh, . . .
• The good: all memory can be directly addressed
• The bad: different access latencies
• Consider placement of data
Scalability and Heterogeneity Slide 7 of 39
NUMA Machine
Measuring NUMA effects on:
Daniel Muller: Memory and Thread Management on NUMA Systems, Diploma Thesis, 2013
Scalability and Heterogeneity Slide 8 of 39
NUMA Effects
Operation Access Time NUMA Factor
read local 37.420s 1.000read remote 53.223s 1.422write local 23.555s 1.000write remote 23.976s 1.018
Daniel Muller: Memory and Thread Management on NUMA Systems, Diploma Thesis, 2013
Scalability and Heterogeneity Slide 9 of 39
NUMA Mechanisms
Daniel Muller: Memory and Thread Management on NUMA Systems, Diploma Thesis, 2013
Scalability and Heterogeneity Slide 10 of 39
NUMA Policies
• fundamental options:migrate thread vs. migrate data
• use performance counters to decide
• dynamic management shows > 10% performancebenefit compared to best static placement
Scalability and Heterogeneity Slide 11 of 39
HPC
Core Core
Mem MemInterconnect
K
RT
App
K
RT
AppMPI
Scalability and Heterogeneity Slide 12 of 39
Research Prototypes
Scalability and Heterogeneity Slide 13 of 39
Barrelfish
Andrew Baumann et al.: The Multikernel: A new OS architecture for scalable multicore systems, SOSP 2009
Scalability and Heterogeneity Slide 14 of 39
Barrelfish
• Concept: multikernel,implementation: barrelfish
• Treat the machine as cores with a network
• “CPU driver” plus exokernel-ish structure
• No inter-core sharing at the lower levels
• Monitors coordinate system-wide state viareplication and synchronization
Andrew Baumann et al.: The Multikernel: A new OS architecture for scalable multicore systems, SOSP 2009
Scalability and Heterogeneity Slide 15 of 39
Cosh
• Based on Barrelfish
• Introduces abstractions for non-CC systems
• Takes advantage of CC, if possible
• Otherwise, data transfers via, e.g., DMA units
• Used to implement OS services (net, fs, . . . )
• Evaluated for Intel i7 CPU + Intel Knights Ferry
Andrew Baumann et al.: Cosh: clear OS data sharing in an incoherent world, TRIOS 2014
Scalability and Heterogeneity Slide 16 of 39
Barrelfish + Cosh
Core Core
Mem MemInterconnect
K
Mon
K
MonMessages
Acc
Mem
App AppMessages
Scalability and Heterogeneity Slide 17 of 39
Barrelfish Scalability
• Driven by scalability issues of shared kerneldesigns and cache coherence
• This might not be a pressing issue today
Andrew Baumann et al.: The Multikernel: A new OS architecture for scalable multicore systems, SOSP 2009
Scalability and Heterogeneity Slide 18 of 39
Factored Operating System
David Wentzlaff, Anant Agarwal: Factored Operating Systems (fos): The Case for a Scalable OperatingSystem for Multicores, SIGOPS OSR 2009
Scalability and Heterogeneity Slide 19 of 39
Popcorn Linux
• Idea: multiple Linux’s on one system
• Provide the illusion of an POSIX SMP system
• Kernels communicate to sync/exchange state
• Does not rely on global shared memory
• Distributed shared memory, if necessary
• Processes can migrate between kernels
Barbalace et al.: Popcorn: Bridging the Programmability Gap in Heterogeneous-ISA Platforms, EuroSys 2015
Scalability and Heterogeneity Slide 20 of 39
Popcorn Linux
Barbalace et al.: Popcorn: Bridging the Programmability Gap in Heterogeneous-ISA Platforms, EuroSys 2015
Scalability and Heterogeneity Slide 21 of 39
Popcorn Linux
Core Core
Mem MemInterconnect
K
Runtime
KMessages
Application
Acc
Mem
Scalability and Heterogeneity Slide 22 of 39
Helios
• Idea: heterogeneous ISA systems need some kindof compiler support
• ISA-specific kernels: “satellite kernels”
• Provide uniform OS abstractions
• Memory management, scheduling
• Bootstrap: first kernel becomes coordinator, bootsother cores
Nightingale et al.: Helios: Heterogeneous Multiprocessing with Satellite Kernels, SOSP 2009
Scalability and Heterogeneity Slide 23 of 39
Helios
• Share-nothing, even on ccNUMA
• Processes cannot span across kernels
• Implementation based on Singularity
• Applications compiled into intermediate code
• 2nd stage compilation to native code of allavailable ISAs at install time
• Placement based on affinity hints
Nightingale et al.: Helios: Heterogeneous Multiprocessing with Satellite Kernels, SOSP 2009
Scalability and Heterogeneity Slide 24 of 39
Helios
Nightingale et al.: Helios: Heterogeneous Multiprocessing with Satellite Kernels, SOSP 2009
Scalability and Heterogeneity Slide 25 of 39
Helios
Mem
Acc
Mem
K
RT
App
Core Core
Application
OS Kernel
Runtime
CC
Chan
IC
Scalability and Heterogeneity Slide 26 of 39
Our Own Work
Scalability and Heterogeneity Slide 27 of 39
M3
Approach
IntelXeon
IntelXeon
ARMbig
ARMLITTLEDSP
DSPAudio
Decoder
FPGA
MemMemMemMem
Mem Mem Mem Mem
DTUDTUDTUDTU
DTU DTU DTU DTU
Scalability and Heterogeneity Slide 28 of 39
M3
Approach
IntelXeon
IntelXeon
ARMbig
ARMLITTLEDSP
DSPAudio
Decoder
FPGA
MemMemMemMem
Mem Mem Mem Mem
DTUDTUDTUDTU
DTU DTU DTU DTU
PE PE PE PE
PEPEPEPE
App
AppAppApp App
AppAppKernel
Scalability and Heterogeneity Slide 28 of 39
Data Transfer Unit
• Supports memory access and message passing
• Provides a number of endpoints• Each endpoint can be configured for:
1 Accessing memory (contiguous range, byte granular)2 Receiving messages into a ringbuffer3 Sending messages to a receiving endpoint
• Configuration only by kernel, usage by application
• Direct reply on received messages
Scalability and Heterogeneity Slide 29 of 39
M3
System Call
ARMbigDSP
Mem DTU
AppKernel
DTUMemS R
Scalability and Heterogeneity Slide 30 of 39
M3
= L4 ±1
• Microkernel-based system for het. manycores
• Implemented from scratch
• Mechanisms for PEs, memory and communication
• Drivers, filesystems, . . . are implemented on top
• Kernel manages permissions, using capabilities
• DTU enforces permissions (communication,memory access)
• Kernel is independent of other cores in the system
Scalability and Heterogeneity Slide 31 of 39
Virtual PEs
• Creating VPE yields a VPE cap. and memory cap.
• Library provides primitives like fork and exec
Execute function on different PE
VPE vpe("test");
vpe.run_async([]() {
Serial::get() << "Hello World!\n";
return 0;
});
int exitcode = vpe.wait();
Scalability and Heterogeneity Slide 32 of 39
Virtual PEs
• Creating VPE yields a VPE cap. and memory cap.
• Library provides primitives like fork and exec
Execute function on different PE
VPE vpe("test");
vpe.run_async([]() {
Serial::get() << "Hello World!\n";
return 0;
});
int exitcode = vpe.wait();
Scalability and Heterogeneity Slide 32 of 39
Filesystem: m3fs
• FS service is implemented outside of kernel
• m3fs is (currently) an in-memory filesystem
m3fs App kernel
Mem FS
PE1 PE2 PE3
Scalability and Heterogeneity Slide 33 of 39
Filesystem: m3fs
• FS service is implemented outside of kernel
• m3fs is (currently) an in-memory filesystem
m3fs App kernel
Mem FS
PE1 PE2 PE3open
Scalability and Heterogeneity Slide 33 of 39
Filesystem: m3fs
• FS service is implemented outside of kernel
• m3fs is (currently) an in-memory filesystem
m3fs App kernel
Mem FS
PE1 PE2 PE3obtain
Scalability and Heterogeneity Slide 33 of 39
Filesystem: m3fs
• FS service is implemented outside of kernel
• m3fs is (currently) an in-memory filesystem
m3fs App kernel
Mem FS
PE1 PE2 PE3obtain
obtain
Scalability and Heterogeneity Slide 33 of 39
Filesystem: m3fs
• FS service is implemented outside of kernel
• m3fs is (currently) an in-memory filesystem
m3fs App kernel
Mem FS
PE1 PE2 PE3
read/write
Scalability and Heterogeneity Slide 33 of 39
M3
Core Core
Mem MemInterconnect
K
Srv SrvMessages
Acc
Mem
App AppMessages
Scalability and Heterogeneity Slide 34 of 39
Tomahawk
Xtensa LX4
Instr.SPM
DataSPM
DTU
PEPEPE
PE
PE PE
PE
DRAM
RRR
R R R
RRR
PE
MemCtrl.
• Cores attached to NoC with DTU
• No privileged mode
• No MMU, no caches, but SPM
• Only simple DTU + SW emulation
Scalability and Heterogeneity Slide 35 of 39
Linux
• M3 runs on Linux using it as a virtual machine
• A process simulates a PE, having two threads(CPU + DTU)
• DTUs communicate over UNIX domain sockets• No accuracy because
– Programs are directly executed on host– Data transfers have huge overhead compared to HW
• Very useful for debugging and early prototyping
Scalability and Heterogeneity Slide 36 of 39
gem5
• Modular platform for computer-systemarchitecture research
• Supports various ISAs (x86, ARM, Alpha, . . . )
• Cycle-accurate simulation
• Has an out-of-order CPU model
• We built a DTU for gem5
• Support for caches and virtual memory
Scalability and Heterogeneity Slide 37 of 39
gem5 – Example Configuration
PE
...
DTUCtrl
ME
DTU
x86
L1L2
L1L2
L1
DTUSPM
DRAM
PE
PEPE
TLB PT
DTUTLB PT
DTUTLB PT
x86
x86x86
Scalability and Heterogeneity Slide 38 of 39
Summary and Outlook
• Various different approaches
• Not clear yet how to handle heterogeneity
• Memory will get heterogeneous as well (NVM)
• Reconfigurable hardware will emerge
Scalability and Heterogeneity Slide 39 of 39