High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13%...

High Performance Data Center Operating Systems

TomAndersonAntoineKaufmann,YoungjinKwon,NaveenKr.Sharma,ArvindKrishnamurthy,SimonPeter,MothyRoscoe,and

EmmettWitchel

An OS for the Data Center

•  ServerI/Operformancematters•  Key-valuestores,web&fileservers,databases,mailservers,machinelearning,…

• Canwedeliverperformanceclosetohardware?•  Examplesystem:DellPowerEdgeR520

IntelX52010GNIC

IntelRS3RAID1GBflash-backedcache

SandyBridgeCPU6cores,2.2GHz

Today’sI/Odevicesarefastandgettingfaster

2us/1KBpacket 25us/1KBwrite

40GNIC:500ns/1KBpacketNVDIMM:500ns/1KBwrite

Can’t we just use Linux?

Kernel

Linux I/O Performance

RedisHW13%

Kernel84%

Kernel62%

App20%

%OF1KBREQUESTTIMESPENT

API Multiplexing

Naming Resourcelimits

Accesscontrol I/OScheduling

I/OProcessing Copying

Protection

DataPath

10GNIC2us/1KBpacket

RAIDStorage25us/1KBwrite

Kernelmediationistooheavyweight

Arrakis: Separate the OS control and data plane

KernelNaming Resourcelimits

Accesscontrol

How to skip the kernel?

I/ODevices

API Multiplexing

I/OScheduling

I/OProcessing Copying

Protection

DataPath

Kernel

Naming

Resourcelimits

Accesscontrol

Kernel

Naming

Resourcelimits

Accesscontrol

Arrakis I/O Architecture

I/ODevices

Multiplexing

I/OScheduling

I/OProcessing

Protection

DataPath

ControlPlane DataPlane

•  StreamlinenetworkandstorageI/O•  EliminateOSmediationinthecommoncase•  Application-specificcustomizationvs.kernelonesizefitsall

• KeepOSfunctionality•  Process(container)isolationandprotection•  Resourcearbitration,enforceableresourcelimits•  Globalnaming,sharingsemantics

• POSIXcompatibilityattheapplicationlevel•  AdditionalperformancegainsfromrewritingtheAPI

Design Goals

This Talk

Arrakis(OSDI14)•  OSarchitecturethatseparatesthecontrolanddataplane,forbothnetworkingandstorage

Strata(SOSP17)•  Filesystemdesignforlowlatencypersistence(NVM)andmulti-tierstorage(NVM,SSD,HDD)

TCPasaService/FlexNIC/Floem(ASPLOS15,OSDI18)•  OS,NIC,andapplibrarysupportforfast,agile,secureprotocolprocessing

Storage diversification

Betterperform

Highercapacity

Largeerasureblock:hardwareGCoverheadRandomwritescause5-6xslowdownbyGC

Byte-addressable:cache-linegranularityIODirectaccesswithload/storeinstructions

Latency Throughput $/GBDRAM 80ns 200GB/s 10.8NVDIMM 200ns 20GB/s 2SSD 10us 2.4GB/s 0.26HDD 10ms 0.25GB/s 0.02

Let’s Build a Fast Server

Keyvaluestore,database,fileserver,mailserver,…Requirements:

•  Smallupdatesdominate• Datasetscalesuptomanyterabytes• Updatesmustbecrashconsistent

A fast server on today’s file system

Small,randomIOisslow!

• Smallupdates(1Kbytes)dominate• Datasetscalesupto100TB• Updatesmustbecrashconsistent

Evenwithanoptimizedkernelfilesystem,NVMistoofast,kernelisthebottleneck

Application

Kernelfilesystem

NVM 0 1.5 3 4.5 6

IO latency (us)

Write to device Kernel code 91%

Kernelfilesystem:NOVA[FAST16,SOSP17]

UsingonlyNVMistooexpensive!$200Kfor100TB

Kernelfilesystem

Application

Tosavecost,needawaytousemultipledevicetypes:NVM,SSD,HDD

Block-levelcaching

NVMSSDHDD

Block-levelcachingmanagesdatainblocks,butNVMisbyte-addressable!

Kernelfilesystem

0 3 6 9 12 15

IO latency (us)

Byte-addressed NVM

Better

Application

Forlow-costcapacitywithhighperformance,mustleveragemultipledevicetypes

Pillaietal.,OSDI2014

0 2 4 6 8 10 12

SQLite ZooKeeper

HSQLDB Git

Crash vulnerabilities

Applicationsstruggleforcrashconsistency

Today’s file systems: Limited by old design assumptions

KernelmediateseveryoperationNVMistoofast,kernelisthebottleneckTiedtoasingletypeofdeviceForlow-costcapacitywithhighperformance,mustleveragemultipledevicetypes(NVM,SSD,HDD)AggressivecachinginDRAM,onlywritetodevicewhenyoumust(fsync)Applicationsstruggleforcrashconsistency

Strata: A Cross Media File System

Performance:Especiallysmall,randomIO• Fastuser-leveldeviceaccess

Capacity:leverageNVM,SSD&HDDforlowcost• Transparentdatamigrationacrossdifferentmedia• EfficientlyhandledeviceIOproperties

Simplicity:intuitivecrashconsistencymodel• In-order,synchronousIO• Nofsync()required

Strata: main design principle

LogoperationstoNVMatuser-level

Intuitivecrashconsistency Kernelbypass,butprivate

Coordinatemulti-processaccesses

DigestandmigratedatainkernelApplylogoperationstoshareddata

KernelFS

Performance,simplicity:

Capacity:

Strata

•  LibFS:logoperationstoNVMatuser-level•  Fastuser-levelaccess•  In-order,synchronousIO

•  KernelFS:Digestandmigratedatainkernel•  Asynchronousdigest•  Transparentdatamigration•  Sharedfileaccess

Log operations to NVM at user-level

•  Fastwrites• DirectlyaccessfastNVM

•  Sequentiallyappenddata• Cache-linegranularity• Blindwrites

• Crashconsistency• Oncrash,kernelreplayslog

Application

Kernel-bypass

NVMPrivateoperationlog

creatsyscall renamesyscall …Fileoperations(data&metadata)madebyasinglesystemcall

Intuitive crash consistency

Application

Kernel-bypass

SynchronousIO

• Wheneachsystemcallreturns:• Data/metadataisdurable•  In-orderupdate• Atomicwrite• Limitedsize(logsize)

FastsynchronousIO:NVMandkernel-bypass

Privateoperationlog

fsync()isno-op

KernelFS

•  Visibility:makeprivatelogvisibletootherapplications

•  Datalayout:turnwrite-optimizedtoread-optimizedformat

•  Large,batchedIO•  Coalescelog

Digest data in kernel

Digest

SharedareaPrivateoperationlog

Application

Digest optimization: Log coalescing SQLite,Mailserver:crashconsistentupdateusingwriteaheadlogging

Create journal fileWrite data to journal fileWrite data to database fileDelete journal file

Digesteliminatesunneededwork

Write data to database file

Removetemporarydurablewrites

Privateoperationlog

Application

LibFS KernelFS

Throughputoptimization:LogcoalescingsavesIOwhiledigesting

NVMSharedarea

ApplicationStrata: LibFS

Strata: KernelFS NVMSharedareaPrivateoperationlog

Digest and migrate data in kernel

ApplicationStrata: LibFS

Strata: KernelFS NVMSharedareaPrivateoperationlog

SSDSharedarea

HDDSharedarea

Low-costcapacityKernelFSmigratesdatatolowerlayer

Digest and migrate data in kernel

NVMdataLogs

Digest

Resembleslog-structuredmerge(LSM)tree

HandledeviceIOproperties

Write1GBsequentiallyfromNVMtoSSD

AvoidSSDgarbagecollectionoverhead

Device management overhead

0.1 0.25 0.5 0.6 0.7 0.8 0.9 1

SSD utilization

64 MB 128 MB 256 MB 512 MB 1024 MB

UseNVMlayeraspersistentwritebuffer

SSDpreferslargesequentialIO

Read: hierarchical search

Application

KernelFS

NVMSharedareaPrivateOPlog

SSDSharedarea

HDDSharedarea

NVMdata

SSDdata

HDDdata

Logdata21

Searchorder

Shared file access

Leasesgrantaccessrightstoapplications[SOSP’89]RequiredforfilesanddirectoriesFunctionlikelock,butrevocableExclusivewriter,sharedreaders

Onrevocation,LibFSdigestsleaseddataLeasesserializeconcurrentupdates

Shared file access

LeasesgrantaccessrightstoapplicationsAppliedtoadirectoryorafileExclusivewriter,sharedreaders

Application1 LibFS

KernelFS

Application2 LibFS

RequestwriteleasetofileA

LWritefileAdata

RequestwriteleasetofileA

Revokethewritelease

WritefileAdata

Example:concurrentwritestothesamefileA

OPlog1 OPlog2 Sharedarea

Experimental setup

•  2xIntelXeonE5-2640CPU,64GBDRAM•  400GBNVMeSSD,1TBHDD•  Ubuntu16.04LTS,Linuxkernel4.8.12

•  EmulatedNVM•  Use40GBofDRAM•  Performancemodel[Y.Zhangetal.MSST2015]

•  Throttlelatency&throughputinsoftware

• CompareStratavs.•  PMFS,Nova,ext4-DAX:NVMkernelfilesystems•  Nova:atomicupdate,in-ordersynchI/O•  PMFS,ext4-DAX:noatomicwrite

Latency: LevelDB

Write sync.

Write rand.

Delete rand.

Strata PMFS NOVA EXT4-DAX

35.249.2 37.7

Better25%better

17%better

Latency(us)

LevelDB(NVM)

Keysize:16B

Valuesize:1KB

300,000objects

Levelcompactioncausesasynchronousdigests

Fastuser-levellogging

Randomwrite25%betterthanPMFS

Overwrite17%betterthanPMFS

Throughput: Varmail

MailserverworkloadfromFilebench•  UsingonlyNVM•  10000files•  Read/Writeratiois1:1• Write-aheadlogging

Logcoalescingeliminates86%oflogrecords,saving14GBofIO

0K 100K 200K 300K 400K

Strata

Throughput (op/s)

Better

29%better

Create journal fileWrite data to journalWrite data to database fileDelete journal file

Digest eliminates unneeded work

Write data to database file

Removes temporary durable writes

KernelFSApplication

Log coalescing

This Talk

Keyvaluestore,database,mailserver,ML,…Requirements:

• MostlysmallRPCsoverTCP•  40Gbpsnetworklinks(100+Gbpssoon)•  Enforceableresourcesharing(multi-tenant)• Agileprotocoldevelopment:kernelandapp•  Taillatency,costefficienthardware

Application

Linuxnetworkstack

40GbpsNIC

Kernelmediationtooslow

•  SmallRPCsdominate•  Enforceableresourcesharing•  Agileprotocoldevelopment•  Cost-efficienthardware

Overhead: 97%

• Directaccesstodeviceatuser-level• Multiplexing

•  SR-IOV:VirtualPCIdevicesw/ownregisters,queues,INTs

• Protection•  IOMMU:DMAto/fromappvirtualmemory

•  Packetfilters:ex:legalsourceIPheader

• mTCP:2-3xfasterthanLinux• Whoenforcescongestioncontrol?

Hardware I/O Virtualization

SR-IOVNIC

Packetfilters

Network

Ratelimiters

User-levelVNIC1

User-levelVNIC2

Remote DMA (RDMA)

Programmingmodel:read/writeto(limited)regionofremoteservermemory

•  Modeldatestothe80’s(AlfredSpector)•  HPCcommunityrevivedforcommunicationwithinarack•  ExtendedtodatacenteroverEthernet(RoCE)•  Commerciallyavailable100GNICs

NoCPUinvolvementontheremotenode•  Fastifappcanuseprogrammingmodel

Limitations:• Whatifyouneedremoteapplicationcomputation(RPC)?•  Losslessmodelisperformance-fragile

Smart NICs (Cavium, …)

NICwitharrayoflow-endCPUcores(Cavium,…)IfcomputeontheNIC,maybedon’tneedtogoCPU?

•  ApplicationsinhighspeedtradingWe’vebeenherebefore:“wheelofreinvention”

•  Hardwarerelativelyexpensive•  AppsoftensloweronNICvs.CPU(cf.Floem)

Step 1

BuildafasterkernelTCPinsoftwareNochangeinisolation,resourceallocation,APIQ:WhyisRPCoverLinuxTCPissoslow?

OS - Hardware Interface

• Highly optimized code path • Buffer descriptor queues

•  No interrupt in common case •  Maximize concurrency

Core1 Core2 Core3 Core4

NetworkInterfaceCard

Operating System Tx Rx Tx Rx Tx Rx Tx Rx

OS Transmit Packet Processing

•  TCPlayer:movefromsocketbuffertoIPqueue•  Locksocket•  Congestion/flowcontollimit•  FillinTCPheader,calculatechecksum•  Copydata•  Armre-transmissiontimeout

•  IPlayer:•  firewall,routing,ARP,trafficshaping

• Driver:movefromIPqueuetoNICqueue• Allocateandfreepacketbuffers

Sidebar: Tail Latency

OnLinuxwitha40Gbpslink,400outboundTPCflowssendingRPCs,nocongestion,whatistheminimumrateacrossallflows?

Kernel and Socket Overhead

events = poll(….) For e in events: If e.socket != listen_sock: receive(e.socket, ….) … send(e.socket, ….)

Multiple synchronous kernel transitions: • Parameter checks and copies • Cache pollution, pipeline stalls

Application Kernel

Synchronoussystemcalls

TCP Acceleration as a Service (TaS)

•  TCPasauser-levelOSservice•  SRIO-Vtodedicatedcores•  Scalenumberofcoresup/downtomatchdemand•  Optimizeddataplaneforcommoncaseoperations

• Applicationusesitsowndedicatedcores•  Avoidpollutingapplicationlevelcache•  Fewercores=>betterperformancescaling

•  Totheapplication,per-sockettx/rxqueueswithdoorbells

•  Analogoustohardwaredevicetx/rxqueues

Streamline common-case data path • Removeunneededcomputationfromdatapath

•  Congestioncontrol,timeoutsperRTT(notperpacket)

• Minimizeper-flowTCPstate•  prefetch2cachelinesonpacketarrival

•  Linearizedcode•  Betterbranchprediction•  Super-scalarexecution

•  EnforceIPlevelaccesscontroloncontrolplaneatconnectionsetup

Small RPC Microbenchmark

App:1FP:1

App:4FP:7 App:8

FP:114.9x

•  IX:fastkernelTCPwithsyscallbatching,non-socketAPI•  Linux/TaSlatency:7.3x;IX/TaSlatency:2.9x

TAS is workload proportional

0 10 20 30 40 50 60 70 80 90 1001

Latency[us]

Time[s]

FastpathCores[#

Cores Latency

•  Setup:4clientsstartingevery10seconds,thenstoppingincrementally

Step 2

TCPasaServicecansaturatea40GbpslinkwithsmallRPCs,butwhatabout100Gbpsor400Gbpslinks?

Networklinkspeedsscalingupfasterthancores

WhatNIChardwaredoweneedfastdatacentercommunication?

TCPasaServicedataplanecanbeefficientlybuiltinhardware

FlexNIC Design Principles

• RPCsarethecommoncase•  Kernelbypasstoapplicationlogic

•  Enforceableper-flowresourcesharing•  Dataplaneinhardware,policyinkernel

• Agileprotocoldevelopment•  Protocolagnostic(ex:TimelyandDCTCPandRDMA)•  Offloadbothkernelandapppackethandling

• Cost-efficient•  Minimalinstructionsetforpacketprocessing

FlexNIC Hardware Model Programmable

ParserPacket

Stream ...TCAM

REGs...

EgressQueues

TCP UDP

•  TCAMforarbitrarywildcardmatches

•  SRAMforexact/LPMlookups

•  Statefulmemoryforcounters

•  ALUsformodifyingheadersandregisters

1.p=lookup(eth.dst_mac)

2.pkt.egress_port=p

3.counter[ipv4.src_ip]++

Action

49BarefootNetworksswitchRMT~6Tbps(withparallelstreams)

FlexNIC Hardware Model

•  TransformpacketsforefficientprocessinginSW•  DMAdirectlyintoandoutofapplicationdatastructures•  SendacknowledgementsonNIC•  Queuemanagerimplementsratelimits•  Improvelocalitybysteeringtocoresbasedonappcriteria

RxPipeline

DBPipeline

TxPipeline

DMAPipeline PCIeDMA

FromNetwork

ToNetwork

FromPCIe(Doorbells)

FlexTCP: H/W Accelerated TCP

•  FastpathissimpleenoughforFlexNICmodel• ApplicationsdirectlyaccessNICforRX/TX

•  SimilarinterfacetoTCPasaservice:in-memoryqueues•  Softwareslow-pathmanagesNICstate

•  StreamlinesNICprocessing•  ACKsconsumed/generatedinNIC,reducesPCIetraffic•  Nodescriptorqueuesw/dependentDMAreads

•  Evaluation:softwareFlexNICemulator

FlexTCP Performance

•  Latency:7.8xbettervsLinux

10.7xvsLinux4.1xvsmTCP

7.2xvsLinux2.2xvsmTCP

Streamlining App RPC Processing

• How do we further reduce CPU overhead? •  Integration of app logic with network code

•  Remove socket abstraction, streamline app code

• But fundamental app bottlenecks remain •  Data structure synchronization, cache utilization •  Copies between app data structures and RPC buffers

•  Idea: leverage FlexNIC to streamline app •  FlexNIC is protocol agnostic, can parse on app protocol

Example: Key-Value Store

Hash Table

Core2NIC

Receive-sidescaling:core=hash(connection)%N

Client1K=3,4

Client2K=1,4,7

Client3K=1,7,8

•  Lockcontention•  Poorcacheutilization

Optimizing Reads: Key-based Steering

Core2NIC

Hash Table

Client1K=4,3

Client2K=1,4,7

Client3K=1,7,8

Match:IFudp.port==kvsAction:core=HASH(kvs.key)%2DMAhash,kvsTOCores[core]•  Nolocksneeded

•  Bettercacheutilization40%higherCPUefficiency

Summary

Biography

• College:physics->psychology->philosophy•  TookthreeCSclassesasasenior

• Aftercollege:developedanOSforaz80•  Afterprojectshipped,projectgotcancelled•  Appliedtogradschool:sevenoutofeightturnedmedown

• Gradschool•  Learnedalot•  Dissertationhadzerocommercialimpactfordecades

• Asfaculty•  IlearnalotfromthepeopleIworkwith•  Trytopicktopicsthatmatter,andwhereIgettolearn

What about the CPU overhead?

Example: Key-Value Store

HashTable

Core2NIC

Receive-sidescaling:core=hash(connection)%N

Client1K=3,4

Client2K=1,4,7

Client3K=1,7,8

•  Lockcontention•  Coherencetraffic•  Poorcacheutilization

Optimizing Reads: Key-based Steering

Core2NIC

HashTable

Client1K=4,3

Client2K=1,4,7

Client3K=1,7,8

Match:IFudp.port==kvsAction:core=HASH(kvs.key)%2DMAhash,kvsTOCores[core]•  Nolocksneeded

•  Nocoherencetraffic•  Bettercacheutilization

Where is hardware going?

The Example of Bitcoin

Bitcoinisaprotocolformaintainingabyzantinefaulttolerantpublicledger

•  Cryptographicallysignedledgerofofasequenceoftransfersofdigitalcurrency,butcouldbeanything

Providesanincentivetoparticipateinthefaulttolerantprotocol

•  Proofofwork:Solveacryptographicpuzzle,getareward•  Hardtomonopolize

Extremelylowbandwidth:afewtransactions/second•  EnergycostroughlythesameasIreland

Bitcoin Hardware Progression

CPU->GPU->FPGA->LowtechVLSI->HightechVLSIMarketprice:1petahashofSHA-256=$0.01

Price and Difficulty

FPGAs, GPU, >55nm out of business

High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13%...

Documents

Transcript of High Performance Data Center Operating Systemshomes.cs.washington.edu/~tom/talks/os.pdfRedis HW 13%...

UNDERSTANDING ARM HW D O - eLinux.org · Kernel debugging ... OpenOCD Project ... Set H/W breakpoint using the JTAG Load the vmlinux via the JTAG ...

Kernel Recipes 2015 - Kernel dump analysis

Kernel and Non Kernel Clauses

OpenMP Affinity - IXPUG · 2019-12-13 · Setting Mask 9 User App Running App OS Kernel Hardware N MPI Tasks M OMP Threads Hardware/OS Setup Environment Variables OpenMP Aﬃnity

Kernel Exploitation - exploit.courses · Kernel Exploitation - Content. compass-security.com 3 Kernel Basics Kernel Exploitation. ... (post breach persistence / hiding) Backdoors

Kernel Architecture : UNIX Kernel - NPTel

AI, app-defined OS, and the ossified kernel

Digital Transformation of Healthcare - Computerworldweb.idg.no/app/web/online/Event/hw/2012/presentasjoner/Naraghi.pdf · Digital Transformation of Healthcare ... Source: IBM “Redefining

Rushyl hw app

BIOGEOCHEMICAL BUDGET THROUGH THE STRAIT …doga.ogs.trieste.it/doga/echo/ecem07/Main/30Nov/Macias_D.pdf · HW+5 HW-5 HW-3 HW-1 HW+1 HW+3 HW+5 HW-5 HW-3 HW-1 HW+1 HW+3 HW+5 Two layer

ENGINEERING HW - Lost River Development - Horace, ND€¦ · x x x x x bcl bcl bcl bcl bcl bcl bcl bcl bcl bcl bcl bcl bcl bcl bcl hw hw hw hw hw hw hw hw hw hw hw hw hw hw hw drainage

Tony 684 User Data App Services Arun 765 Thomas 762 Shell Kernel Shawn 626 App Roberth 779 Enterprise John Ignite Store Barclay 695 Andrew 617 App Model.

IA-64 Linux Kernel Technical Update · – kernel configurable to 4, 8, 16, or 64KB – use getpagesize() to get page size in app (DON’T hardcode any particular value) – why a

Kernel Methods Arie Nakhmani. Outline Kernel Smoothers Kernel Density Estimators Kernel Density Classifiers.

Building Blocks for Embedded Power Management · • PM as building blocks – Strong base: HW features + kernel support – Suspend and Resume – managing idle – DVFS – application

Take Out Yesterday's HW Tuesday's HW - 1.3 HW · Take Out Yesterday's HW 1.4 discussion over angles Doing HW on Math XL? HW - 1.4 HW Page 31 #6, 9, 19, 21, 23 ... Using the Angle

REGISTERED - Design Eq...baratta street steps brick garage hw hw hw hw hw hw hw w w w w carport brick garage hw hw hw hw hw w w w hw hw hw carport brick garage brick garage w 16.66

Kernel Recipes 2016 - The kernel report

2017 Summer Camp day5 part1 - University of Cambridge€¦ · §OSNT is an open source HW/SW platform for network testing OSNT Hardware NetFPGA OS Kernel Kernel Driver OSNT Applications

Enterprise Wide Meta Data Archiveweb.idg.no/app/web/online/Event/hw/2012/presentasjoner/Flamme.pdf · • VMWare Products (Vsphere, LabManager, ….) • Network Environment ... •