ReFlex: Remote Flash at the Performance of Local Flash Tier Clients TCP/IP Datastore Service App...

ReFlex: Remote Flash at the Performance of Local Flash

Ana Klimovic Heiner Litz Christos Kozyrakis

Flash Memory Summit 2017 Santa Clara, CA

1 Stanford MAST!

Flash in Datacenters

§  NVMe Flash •  1,000x higher throughput than disk (1MIOPS) •  100x lower latency than disk (50-70usec)

§  But Flash is often underutilized due to imbalanced resource requirements

Example Datacenter Flash Use-Case

AppTier

Clients TCP/IP

DatastoreService

AppServers

Key-ValueStoreget(k)put(k,val)

Applica(onTier DatastoreTier

So9ware

Hardware

Imbalanced Resource Utilization

Sample utilization of Facebook servers hosting a Flash-based key-value store over 6 months

4 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar.

5 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar.

u"liza"on

[EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar.

Flash capacity and bandwidth are underutilized for long periods of time

7 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar

u"liza"on

Solution: Remote Access to Storage

§  Improve resource utilization by sharing NVMe Flash between remote tenants

§  There are 3 main concerns: 1.  Performance overhead for remote access 2.  Interference on shared Flash 3.  Flexibility of Flash usage

0 50 100 150 200 250 300

IOPS (Thousands)

Local Flash iSCSI (1 core) libaio+libevent (1core)

Issue 1: Performance Overhead

4x throughput drop

2× latency increase

•  Traditional network/storage protocols and Linux I/O libraries (e.g. libaio, libevent) have high overhead

4kB random read

Issue 2: Performance Interference

0 250 500 750 1000 1250

p95read

latency(us)

TotalIOPS(Thousands)

100%read99%read95%read90%read75%read50%read

Writes impact read tail latency

Latency depends on IOPS load

To share Flash, we need to enforce performance isolation 10

Issue 3: Flexibility

§  Flexibility is key §  Want to allocate Flash capacity and bandwidth

for applications: •  On any machine in the datacenter •  At any scale •  Accessed using any network-storage protocol

User Space

ReFlex

HardwareNetworkInterface

Flash Storage

Remote Storage Application

Data PlaneControl Plane

Linux FilesystemBlock I/O

Device Driver

User Space

Flash Storage

How does ReFlex achieve high performance?

Linux vs. ReFlex

[ASPLOS’17] ReFlex: Remote Flash ≈ Local Flash. Ana Klimovic, Heiner Litz, Christos Kozyrakis.

User Space

ReFlex

Flash Storage

Device Driver

User Space

Flash Storage

Linux vs. ReFlex

Remove SW bloat by separating

control & data plane

User Space

ReFlex

Flash Storage

Device Driver

User Space

Flash Storage

Linux vs. ReFlex

DPDK SPDK

Direct access to hardware

1 data plane per CPU core

User Space

ReFlex

Flash Storage

Device Driver

User Space

Flash Storage

Linux vs. ReFlex

Polling vs. interrupts

User Space

ReFlex

Flash Storage

Device Driver

User Space

Flash Storage

Linux vs. ReFlex

Run to completion

User Space

ReFlex

Flash Storage

Device Driver

User Space

Flash Storage

Linux vs. ReFlex

Adaptive batching

User Space

ReFlex

Flash Storage

Device Driver

User Space

Flash Storage

Linux vs. ReFlex

Zero-copy device to device

How does ReFlex enable performance isolation?

§  Request cost based scheduling

§  Determine the impact of tenant A on the tail latency and IOPS of tenant B

§  Control plane assigns tenants with a quota §  Data plane enforces quotas through throttling

Request Cost Modeling

0 250 500 750 1000 1250

p95read

latency(us)

TotalIOPS(Thousands)

0 200 400 600 800 1000

tency(us)

WeightedIOPS(x103tokens/s)

For this device: Write == 10x Read

Compensate for read-write asymmetry

Request Cost Based Scheduling

1ms tail latency SLO

Device max IOPS: 510K

200K IOPS SLO

310K Slack

Tenant A SLO: •  1ms tail latency •  200K IOPS

200K IOPS SLO

210K Slack

100K IOPS SLO

1.5 ms tail latency SLO Tenant A SLO: •  1ms tail latency •  200K IOPS

Tenant B SLO: •  1.5ms tail latency •  100K IOPS

200K IOPS

100K IOPS

Tenant A SLO: •  1ms tail latency •  200K IOPS

Tenant B SLO: •  1.5ms tail latency •  100K IOPS

Tenant C SLO: •  0.5ms tail latency •  200K IOPS

75K slack

0.5ms tail latency SLO Tenant C cannot register since system cannot satisfy its SLO

Network Flash Application

libReFlex

Network RX queue

TCP/IP

Event Conditions

Batched Network/Storage Calls

NVMe SQ

Scheduler

Summary: Life of a Packet in ReFlex

TCP/IP

Event Conditions

Batched Network/Storage Calls

NVMe TCP/IP

Scheduler

Summary: Life of a Packet in ReFlex

Network Flash Application

libReFlex

8 Network RX queue

NVMe CQ

NVMe SQ

Network TX queue

0 250 500 750 1000 1250

IOPS (Thousands)

Local-1T

ReFlex-1T

Libaio-1T Linux-1T

Results: Local ≈ Remote Latency

ReFlex: 850K IOPS/core

over TCP

Linux: 75K IOPS/core

over TCP

Use 1 core

on Flash server

0 250 500 750 1000 1250

IOPS (Thousands)

Local-1T

ReFlex-1T

Libaio-1T

Latency Local Flash 78 µs ReFlex 99 µs Linux 200 µs

Linux-1T

0 250 500 750 1000 1250

IOPS (Thousands)

Local-1T Local-2T ReFlex-1T ReFlex-2T Libaio-1T Libaio-2T

Linux-1T Linux-2T

ReFlex: saturates Flash

Use 2 vs. 1 cores

on Flash server

Results: Performance Isolation

•  Tenants A & B: latency-critical; Tenant C + D: best effort •  Without scheduler: latency and bandwidth QoS for A/B are violated •  Scheduler rate limits best-effort tenants to enforce SLOs

Tenant A Tenant B Tenant C Tenant D

I/O sched disabled I/O sched enabled

1000 1500 2000 2500 3000 3500 4000

Latency SLO

Tenant A IOPS SLO

Tenant B IOPS SLO

100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd

1000 1500 2000 2500 3000 3500 4000

Latency SLO

Tenant A IOPS SLO

Tenant B IOPS SLO

1000 1500 2000 2500 3000 3500 4000

Latency SLO

Tenant A IOPS SLO

Tenant B IOPS SLO

Results: Scalability

§  ReFlex scales to multiple cores (we tested with up to 12 cores) §  ReFlex scales to thousands of tenants and connections

Each tenant issues 100 1kB reads/sec

ReFlex can manage 5,000 tenants with a

single core.

How can I use ReFlex? §  ReFlex is open source:

www.github.com/stanford-mast/reflex

§  Two versions of ReFlex available: 1.  Kernel implementation:

–  Provides a protected network-storage data plane (user application not trusted) –  Builds on hardware support for virtualization –  Requires Dune kernel module for direct access to hardware and memory

management in Linux 2.  User space implementation:

–  Network-storage data plane runs as user space application; kernel-bypass –  Portable implementation, tested on Amazon Web Services i3 instance

ReFlex Clients

§  ReFlex exposes a logical block interface

§  ReFlex client implementations available: 1.  Block I/O client

à For issuing raw network block I/O requests to ReFlex 2.  Remote block device driver

à For legacy Linux applications

Control Plane for ReFlex

§  ReFlex provides an efficient data plane for remote access

§  Bring your own control plane or distributed storage system: •  Global control plane allocates Flash capacity & bandwidth in cluster •  Control plane can be a cluster manager or a distributed storage

system (e.g. distributed file system or database)

§  Example: ReFlex storage tier for Crail distributed file system •  Spark applications can use ReFlex as a remote Flash storage tier,

managed by the Crail distributed data store

Example Use-Case

§  Big data analytics frameworks such as Spark store a variety of data types in memory, Flash and disk

§  Crail is a distributed storage system that manages and provides high-performance access to data across multiple storage tiers

SQL Streaming MLlib GraphX

SparkShuffledata Input/OutputfilesBroadcastdataRDDs

www.crail.io

Example Use-Case §  Spark executors often use (local) Flash to store temporary data (e.g. shuffle)

§  Spark applications often saturate CPU resources before they can saturate NVMe Flash IOPS

à Local Flash bandwidth is underutilized 37

SparkExecutor

Compute

NVMeFlash

SparkExecutor

Compute

NVMeFlash

SparkExecutor

Compute

NVMeFlash

SparkExecutor

Compute

NVMeFlash

Example Use-Case §  ReFlex provides a shared remote Flash tier for big data analytics

•  Improves resource utilization while offering local Flash performance

SparkExecutor

High-BandwidthEthernetNetwork

Compute

SparkExecutor

Compute

SparkExecutor

Compute

SparkExecutor

Compute

ReFlex

NVMeFlash

ReFlex

NVMeFlash

Summary

§  ReFlex enables Flash disaggregation

§  Performance: remote ≈ local

§  Commodity networking, low CPU overhead

§  QoS on shared Flash with I/O scheduler

Thank You!

ReFlex is available at:

https://github.com/stanford-mast/reflex

ReFlex: Remote Flash at the Performance of Local Flash Tier Clients TCP/IP Datastore Service App...

Documents

Transcript of ReFlex: Remote Flash at the Performance of Local Flash Tier Clients TCP/IP Datastore Service App...

Rotate USB Flash Drive - The Range XpressAU_Prices… · Dome Rotate USB Flash Drive Tier 1 quality chips. If product is required undecorated $0.65 can be ... $6.44 $6.05 $5.58 $5.21

048 SAP BW DataStore

CDA Common Data Access Ltd - Well DataStore - Seismic DataStore - DEAL - UKCS Virtual NDR

Flash Forward Magazine - CIRCUS NEBULA · Flash magazine Flash Foruard magaine Flash magaine Flash Forward magazine . Title: Flash Forward Magazine Author: Andrea_2 Created Date:

Title of Presentation - Flash Memory Summit VMware Virtual SAN, ALL read and write operations always go directly to a flash SSD tier Flash-based devices serve two purposes in Virtual

The Atlas Petabyte Datastore

Google App Engine Data Store ae-10-datastore .

Mapping Relational Data Engine Datastore

A YANG Data model for ECA Policy Management This ......Internet-Draft ECA YANG February 2021 datastore, the name of the datastore, typical example of datastore is running, operational

Datastore company profile

Understanding DSO (DataStore Object) Part 1: Standard … · Disclaimer and Liability Notice ... Understanding DSO (DataStore Object) Part 1: Standard DSO . Understanding DSO (DataStore

Oracle Flash Storage System Administrator's Guide · Oracle Flash Storage System Administrator's Guide ... Limit the Storage Capacity for QoS Plus ... Change to Single-Tier QoS ...

FLASH vs. (Simulated) FLASH:

Make the Call on Hybrid vs. All-Flash Arraysdocs.media.bitpipe.com/io_12x/io_122324/item_1108552/hb...should live on flash and which data makes more sense to keep on a disk tier. The

Demartek Evaluation of NetApp Hybrid Array with Flash Pool ......NetApp Flash Pool is a storage cache option within the NetApp Virtual Storage Tier product family, available for NetApp

2 TIER STORAGE ARCHITECTURE · 2015-05-11 · Fast flash storage tier between CN’s and scratch –On CN, ION, or storage array –Our focus has been on ION Simple, hardware async,

CDA Common Data Access Ltd - Well DataStore - Seismic DataStore - DEAL - UKCS Virtual NDR NHR 9 Ninth International National Data Repository Conference.

Wakanda NoSQL Object Datastore - MoscowJS 2011

Multi-Tier Subsystem Management Using SLOs€¦ · Multi-Tier Subsystem Management Using SLOs. Talk Outline ... All-Flash Array Violin WhipTail ... Storage Flash Cache/Tier NetApp

Webinar: How to Create A Two Tier Enterprise With All-Flash and Object Storage