FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A...

32
FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs Jie Zhang, Miryeong Kwon, Donghyun Gouk, Sungjoon Koh, Changlim Lee, Mohammad Alian, Myoungjun Chun, Mahmut Kandemir, Nam Sung Kim, Jihong Kim and Myoungsoo Jung

Transcript of FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A...

Page 1: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

FlashShare: Punching Through Server

Storage Stack from Kernel to Firmware

for Ultra-Low Latency SSDs

Jie Zhang, Miryeong Kwon, Donghyun Gouk, Sungjoon Koh,

Changlim Lee, Mohammad Alian, Myoungjun Chun, Mahmut

Kandemir, Nam Sung Kim, Jihong Kim and Myoungsoo Jung

Page 2: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Executable Summary

FlashSharePunches throughthe performance barriers

Reduce average turnaroundresponse times by 22%

Reduce 99th-percentile turnaround response times

by 31%

Datacenter

Latency-critical application

Throughput application

Memory-likeperformance

Flash Firmware

NVMe Driver

Block Layer

File System

Interference

Lon

ger

I/O

late

ncy

ULL-SSD

Unaware of ULL-SSD

Unaware of latency-critical

application

Barrier

Ult

ra lo

w la

ten

cy

Page 3: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Motivation: applications in datacenterDatacenter executes a wide range of latency-critical workloads.• Driven by the market of social media and web services;• Required to satisfy a certain level of service-level agreement;• Sensitive to the latency (i.e., turn-around response time);

A typical example: Apache

respond

serverapache

mo

nit

or

TCP/IP service

qu

eu

edata obj.

worker

worker

TCP/IP service

worker

data obj.

HTTPrequest

A key metric:user experience

Page 4: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Motivation: applications in datacenter• Latency-critical applications exhibit varying loads during a day.• Datacenter overprovisions its server resources to meet the SLA.• However, it results in a low utilization and low energy efficiency.

Hour of the day

Figure 1. Example diurnal pattern in queries per second for a Web Search cluster1.

1Power Management of Online Data-Intensive Services.

Frac

tio

n o

f Ti

me

CPU utilization

Figure 2. CPU utilization analysis of Google server cluster2.

2The Datacenter as a Computer.

Varyingloads

30% avg.utilization

Page 5: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Motivation: applications in datacenterPopular solution: co-locating latency-critical and throughput workloads.

Micro’11 ISCA’15

Eurosys’14

Page 6: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Challenge: applications in datacenter

Applications:Apache – Online latency-critical application;PageRank – Offline throughput application;

Server configuration:

Experiment: Apache+PageRank vs. Apache only

Performance metrics:SSD device latency;Response time of latency-critical application;

Page 7: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Challenge: applications in datacenterExperiment: Apache+PageRank vs. Apache only

• The throughput-oriented application drastically increases the I/O accesslatency of the latency-critical application.

• This latency increase deteriorates the turnaround response time ofthe latency-critical application.

Fig 2: Apache response time increases due to PageRank.

Fig 1: Apache SSD latency increases due to PageRank.

Page 8: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Challenge: ULL-SSDThere are emerging Ultra Low-Latency SSD (ULL-SSD) technologies,which can be used for faster I/O services in the datacenter.

ZNAND XL-Flash

New NAND Flash

Samsung Toshiba

3us N/A

100us N/A

Optane nvNitro

TechniquePhase change

RAMMRAM

Vendor Intel Everspin

Read 10us 6us

Write 10us 6us

Page 9: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Challenge: ULL-SSDIn this work, we use engineering sample of Z-SSD.

Z-NAND1

TechnologySLC based 3D NAND

48 stacked word-line layer

Capacity 64Gb/die

Page Size 2KB/Page

Z-NAND based archives “Z-SSD”[1] Cheong, Wooseong, et al. "A flash memory controller for 15μs ultra-low-latency SSD using high-speed 3D NAND flash with 3μs read time." 2018 IEEE International Solid-State Circuits Conference-(ISSCC), 2018.

Page 10: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Challenge: Datacenter server with ULL-SSD

Applications:Apache – online latency-critical application;PageRank – offline throughput application;

Device latency analysis

42x36us 28us

Server configuration:

Unfortunately, the short latency characteristics of ULL-SSD cannot be exposed to users (in particular, for the latency-critical applications).

Page 11: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Challenge: Datacenter server with ULL-SSD

The storage stack is unaware of the characteristics of both latency-critical

workload and ULL-SSD

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe Driver

blkmq

The current design of blkmq layer, NVMe driver, and SSD firmware can

hurt the performance of latency-critical applications.

ULL-SSD fails to bring short latency, because of the storage stack.

NVMe Driver

ULL-SSD

Page 12: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Blkmq layer: challenge

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe Driver

I/O

su

bm

issi

on

blkmqSo

ftw

are

Qu

eue

Har

dw

are

Qu

eue

ReqReq

Merge

Req

Req Req ReqIncoming requests

Req

Req

Merge

Req

Queuing Queuing

Software queue: holds latency-critical I/O requests for a long time;

Page 13: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Blkmq layer: challenge

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe Driver

blkmqSo

ftw

are

Qu

eue

Har

dw

are

Qu

eue

Dispatch

Token=1 Token=0Token=0

ReqReqReq

ReqReq

Software queue: holds latency-critical I/O requests for a long time;Hardware queue: dispatches an I/O request without a knowledge of

the latency-criticality;

I/O

su

bm

issi

on

Page 14: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Blkmq layer: optimization

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe Driver

blkmqSo

ftw

are

Qu

eue

Har

dw

are

Qu

eue

ReqReqReq

Our solution: bypass.

Req

Req

Req

Req

Soft

war

e Q

ueu

eH

ard

war

e Q

ueu

e

Byp

ass

LatReq ThrReq ThrReqIncoming requests

LatReq

No merge

No I/O scheduling

Latency-critical I/Os: bypass blkmq for a faster response;

ThrReq ThrReq

Throughput I/Os: merge in blkmq for a higher storage bandwidth.

Little penalty

Addressedin NVMe

I/O

su

bm

issi

on

Page 15: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

NVMe SQ: challenge (bypass is not simple enough)

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe DriverNVMe Driver

ULL-SSD

SQ doorbellregister

NVMecontroller

CQ doorbellregister

Co

re 1

Co

re 0

Co

re 2

NVMe SQ NVMe CQ

Incoming requests

Ring

Head

TailTail

Req

Wait

Head

Head

Fetch

+ Tfetch+ 2xTfetch

NVMe protocol-level queue: a latency-critical I/O request can be blockedby prior I/O requests; Time Cost = Tfetch-self + 2xTfetch >200%

overhead

I/O

su

bm

issi

on

Page 16: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

NVMe SQ: optimizationIncoming

+ 2xT

Target: Designing towards a responsiveness-aware NVMe submission.Key Insight: • Conventional NVMe controller(s) allow to customize the standard arbitration

strategy for different NVMe protocol-level queue accesses.• Thus, we can make the NVMe controller to decide which NVMe command to

fetch by sharing a hint for the I/O urgency.

Page 17: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

NVMe SQ: optimization

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe DriverNVMe Driver

ULL-SSD

SQ doorbellregister

NVMecontroller

CQ doorbellregister

Co

re 1

Co

re 0

Co

re 2

NVMe SQ NVMe CQ

Ring

+ 2xT

Our Solution:

Lat-SQ doorbell

Thr-SQ doorbell

NVMeCTL

CQ doorbell

Co

re 2

Lat-SQ Thr-SQ Co

re 1

Co

re 0

CQ

Incoming requests ThrReq ThrReq LatReq

Ring Postpone

1. Double SQs (one for latency-critical I/Os, another for throughput I/Os)2. Double the SQ doorbell register3. New arbitration strategy: gives the highest priority to the Lat-SQ

Immediate fetch

I/O

su

bm

issi

on

Page 18: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

SSD firmware: challenge

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe Driver

Embedded cache cannot protect the latency-critical I/O from an eviction;

ULL-SSD

NVMe Controller

Caching Layer

FTL

NAND Flash

Caching LayerI/O Hit

Miss

Em

be

dd

ed

Ca

che

way addr

0 0x0

1 0x1

2 0x2

0x1

0x5

0x8

Req@0x01 Req@0x05 Req@0x04

Cost: TCL+TCACHECost: TCL+TFTL+TNAND +TCACHE

0x4

0x1

Evict

Req@0x08Incoming requests

Embedded cache provides the fastest response(DRAM service)

0x0

I/O

su

bm

issi

on

Embedded cache can be polluted by the throughput requests;

Req@0x0b

0xb 0x5

Page 19: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

SSD firmware: optimization

App App App

ULL-SSD

Caching layer

Filesystem

blkmq

NVMe Driver

Our design: splits the internal cache space to protect latency-critical I/Orequests;

ULL-SSD

NVMe Controller

Caching Layer

FTL

NAND Flash

Caching Layer

Em

be

dd

ed

Ca

che

way addr

0 0x0

1 0x1

2 0x2

0x1

0x5

Req@0x01 Req@0x05 Req@0x04

0x4

Evict

Req@0x08Incoming requests

0x0

Protectionregion

0x40x8

I/O

su

bm

issi

on

Page 20: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Filesystem

NVMe Driver

Caching layer

blkmq

NVMe CQ: challenge

App App App

ULL-SSD

NVMe completion: MSI overhead for each I/O request;

ULL-SSD

NVMe Driver

Co

re 1

Co

re 0

Co

re 2

NVMe SQ NVMe CQ

SQ doorbellregister

NVMecontroller

CQ doorbellregister

Message

HeadTail

InterruptController CPU Interrupt

Service Routine

MSI

Inte

rru

pt

contextswitch

Blk

mq

laye

r

contextswitch

Tail

Cost: 2xTCS +TISR

I/O

co

mp

leti

on

Cost: TCS Cost: TCSCost: TISR

Page 21: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Filesystem

NVMe Driver

Caching layer

blkmq

NVMe CQ: optimization

App App App

ULL-SSDULL-SSD

NVMe Driver

Co

re 1

Co

re 0

Co

re 2

NVMe SQ NVMe CQ

SQ doorbellregister

NVMecontroller

CQ doorbellregister

InterruptController CPU Interrupt

Service Routine

MSI

Inte

rru

pt

contextswitch

Blk

mq

laye

r

contextswitch

I/O

co

mp

leti

on

Key insight: state-of-the-art Linux supports a poll mechanism;

Poll

Message

Blkmqlayer Save: 2xTCS +TISRSave: TCSSave: TISRSave: TCS

Page 22: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

NVMe CQ: optimizationPoll mechanism can bring benefits to fast storage device.

4KB8KB

16KB32KB

141618202224262830

Avera

ge L

ate

ncy (

sec)

Interrupt

Polling

4KB8KB

16KB32KB

10

12

14

16

18

20

22

Polling

Avera

ge L

ate

ncy (

sec)

Interrupt

ULL SSD

Decreases by

Read: 7.5% & Write: 13.2%

Read Write

Page 23: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

NVMe CQ: optimizationHowever, the poll-based I/O services consume most host resources.

4KB8KB

16KB32KB

0

20

40

60

80

100

Me

mo

ry B

ou

nd

(%

) Polling

Interrupt

0

20

40

60

80

100

Time

CP

U U

tiliz

ation (

%)

Interrupt

0

20

40

60

80

100

Time

CP

U U

tiliz

ation (

%)

Polling

Page 24: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

InterruptController

Filesystem

NVMe Driver

Caching layer

blkmq

NVMe CQ: optimization

App App App

ULL-SSD

Our solution: selective interrupt service routine (Select-ISR).

ULL-SSD

NVMe Driver

Co

re 1

Co

re 0

Co

re 2

NVMe SQ NVMe CQ

SQ doorbellregister

NVMecontroller

CQ doorbellregister

Message

MSI

Inte

rru

pt

Incoming requests ThrReqLatReq CPU

Blkmqlayer

Message

Interrupt Service Routine

contextswitch

I/O

co

mp

leti

on

Sleep

Page 25: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Design: Responsiveness AwarenessKey Insight: users have a better knowledge of I/O responsiveness (i.e., latency critical/throughput).Our Approach:• Open a set of APIs to users, which pass the workload attribute to Linux PCB.

Call a new utility: chworkload_attr

Modify Linux PCBdata structure

Invoke new system call

Page 26: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Design: Responsiveness AwarenessKey Insight: users have a better knowledge of I/O responsiveness (i.e., latency critical/throughput).Our Approach:• Open a set of APIs to users, which pass the workload attribute to Linux PCB.• Deliver the workload attribute to each layer of storage stack.

Workloadattribute

rsvd2

nvme_cmd

NVMe controller

task_structUser Process

tag array

Embeddedcache

ad

dre

ss_sp

ace

Pa

ge

ca

ch

e

BIOFile system

request

Block layer(blk-mq)

nvme_rw_commandNVMe driver

Page 27: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

More optimizationsAdvanced caching layer designs:

• Dynamic cache split scheme: to maximize cache hits in various request patterns;

• Read prefetching: better utilize SSD internal parallelism;• Adjustable read prefetching with ghost cache: adaptive to different

request patterns;

Hardware accelerator designs:• Conduct simple but timing-consuming tasks such as I/O poll and I/O

merge;• Simplify the design of blkmq and NVMe driver.

Page 28: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Experiment Setup

Test Environment

System configurations:• Vanilla – a vanilla Linux-based computer system running on ZSSD;• CacheOpt – compared to Vanilla, it optimizes the cache layer of SSD firmware;• KernelOpt – it optimizes blkmq layer and NVMe I/O submission;• SelectISR – compared to KernelOpt, it adds the optimization of selective ISR;

http://simplessd.org

Page 29: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Evaluation: latency breakdown

46% reduction

38% reduction 5us reduction

• KernelOpt reduces the time cost of blkmq layer by 46% thanks to no queuing time;• As latency-critical I/Os are fetched by NVMe controller immediately, KernelOpt drastically

reduces the waiting time;• CacheOpt better utilizes the embedded cache layer and reduces the SSD access delays by 38%;• By selectively using polling mechanism, SelectISR can reduce the I/O completion time by 5us.

Page 30: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Evaluation: online I/O access

• CacheOpt reduces the average I/O service latency, but it cannot eliminate the long tails;• KernelOpt can remove the long tails, because it can avoid long queuing time and prevents

throughput I/Os from blocking latency-critical I/Os;• SelectISR reduces the average latency further, thanks to selectively using poll mechanism.

Page 31: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

ConclusionObservation The ultra-low latency of new memory-based SSDs is not exposed to latency-critical application and have no benefit from user-experience angle;ChallengePiecemeal reformations of the current storage stack won’t work due to multiple barriers; the storage stack is unaware of all behaviors of ULL-SSD and latency-critical applications; Our solutionFlashShare: We expose different levels of I/O responsiveness to the key components in the current storage stack and optimize the corresponding system layers to make ULL visible to users (latency-critical applications). Major results• Reducing average turnaround response times by 22%;• Reducing 99th-percentile turnaround response times by 31%.

Page 32: FlashShare: Punching Through Server Storage Stack from ... · Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us

Thank you