Post on 29-Apr-2018
ReFlex: Remote Flash at the Performance of Local Flash
Ana Klimovic Heiner Litz Christos Kozyrakis
Flash Memory Summit 2017 Santa Clara, CA
1 Stanford MAST!
Flash in Datacenters
§ NVMe Flash • 1,000x higher throughput than disk (1MIOPS) • 100x lower latency than disk (50-70usec)
§ But Flash is often underutilized due to imbalanced resource requirements
Example Datacenter Flash Use-Case
3
AppTier
RAM
Flash
NIC
AppTier
Clients TCP/IP
DatastoreService
AppServers
Key-ValueStoreget(k)put(k,val)
Applica(onTier DatastoreTier
CPU
So9ware
Hardware
Imbalanced Resource Utilization
Sample utilization of Facebook servers hosting a Flash-based key-value store over 6 months
4 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar.
Imbalanced Resource Utilization
Sample utilization of Facebook servers hosting a Flash-based key-value store over 6 months
5 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar.
Imbalanced Resource Utilization
Sample utilization of Facebook servers hosting a Flash-based key-value store over 6 months
6
u"liza"on
[EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar.
Imbalanced Resource Utilization
Flash capacity and bandwidth are underutilized for long periods of time
7 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar
u"liza"on
Solution: Remote Access to Storage
§ Improve resource utilization by sharing NVMe Flash between remote tenants
§ There are 3 main concerns: 1. Performance overhead for remote access 2. Interference on shared Flash 3. Flexibility of Flash usage
8
0
200
400
600
800
1000
0 50 100 150 200 250 300
p95
read
late
ncy
(us)
IOPS (Thousands)
Local Flash iSCSI (1 core) libaio+libevent (1core)
Issue 1: Performance Overhead
9
4x throughput drop
2× latency increase
• Traditional network/storage protocols and Linux I/O libraries (e.g. libaio, libevent) have high overhead
4kB random read
Issue 2: Performance Interference
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 250 500 750 1000 1250
p95read
latency(us)
TotalIOPS(Thousands)
100%read99%read95%read90%read75%read50%read
Writes impact read tail latency
Latency depends on IOPS load
To share Flash, we need to enforce performance isolation 10
Issue 3: Flexibility
§ Flexibility is key § Want to allocate Flash capacity and bandwidth
for applications: • On any machine in the datacenter • At any scale • Accessed using any network-storage protocol
User Space
ReFlex
HardwareNetworkInterface
Flash Storage
Remote Storage Application
Data PlaneControl Plane
Linux FilesystemBlock I/O
Device Driver
User Space
Remote Storage Application
HardwareNetworkInterface
Flash Storage
How does ReFlex achieve high performance?
12
Linux vs. ReFlex
[ASPLOS’17] ReFlex: Remote Flash ≈ Local Flash. Ana Klimovic, Heiner Litz, Christos Kozyrakis.
User Space
ReFlex
HardwareNetworkInterface
Flash Storage
Remote Storage Application
Data PlaneControl Plane
Linux FilesystemBlock I/O
Device Driver
User Space
Remote Storage Application
HardwareNetworkInterface
Flash Storage
How does ReFlex achieve high performance?
13
Linux vs. ReFlex
Remove SW bloat by separating
control & data plane
User Space
ReFlex
HardwareNetworkInterface
Flash Storage
Remote Storage Application
Data PlaneControl Plane
Linux FilesystemBlock I/O
Device Driver
User Space
Remote Storage Application
HardwareNetworkInterface
Flash Storage
How does ReFlex achieve high performance?
14
Linux vs. ReFlex
DPDK SPDK
Direct access to hardware
1 data plane per CPU core
User Space
ReFlex
HardwareNetworkInterface
Flash Storage
Remote Storage Application
Data PlaneControl Plane
Linux FilesystemBlock I/O
Device Driver
User Space
Remote Storage Application
HardwareNetworkInterface
Flash Storage
How does ReFlex achieve high performance?
15
Linux vs. ReFlex
IRQ
Polling vs. interrupts
User Space
ReFlex
HardwareNetworkInterface
Flash Storage
Remote Storage Application
Data PlaneControl Plane
Linux FilesystemBlock I/O
Device Driver
User Space
Remote Storage Application
HardwareNetworkInterface
Flash Storage
How does ReFlex achieve high performance?
16
Linux vs. ReFlex
IRQ
Polling vs. interrupts
Run to completion
User Space
ReFlex
HardwareNetworkInterface
Flash Storage
Remote Storage Application
Data PlaneControl Plane
Linux FilesystemBlock I/O
Device Driver
User Space
Remote Storage Application
HardwareNetworkInterface
Flash Storage
How does ReFlex achieve high performance?
17
Linux vs. ReFlex
IRQ
Polling vs. interrupts
Adaptive batching
User Space
ReFlex
HardwareNetworkInterface
Flash Storage
Remote Storage Application
Data PlaneControl Plane
Linux FilesystemBlock I/O
Device Driver
User Space
Remote Storage Application
HardwareNetworkInterface
Flash Storage
How does ReFlex achieve high performance?
18
Linux vs. ReFlex
1.
2. 4.
Zero-copy device to device
3.
How does ReFlex enable performance isolation?
§ Request cost based scheduling
§ Determine the impact of tenant A on the tail latency and IOPS of tenant B
§ Control plane assigns tenants with a quota § Data plane enforces quotas through throttling
19
Request Cost Modeling
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 250 500 750 1000 1250
p95read
latency(us)
TotalIOPS(Thousands)
100%read99%read95%read90%read75%read50%read
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 200 400 600 800 1000
p95Re
adLa
tency(us)
WeightedIOPS(x103tokens/s)
100%read99%read95%read90%read75%read50%read
For this device: Write == 10x Read
Compensate for read-write asymmetry
Request Cost Based Scheduling
21
1ms tail latency SLO
Device max IOPS: 510K
200K IOPS SLO
310K Slack
Tenant A SLO: • 1ms tail latency • 200K IOPS
Request Cost Based Scheduling
22
1ms tail latency SLO
Device max IOPS: 510K
200K IOPS SLO
210K Slack
100K IOPS SLO
1.5 ms tail latency SLO Tenant A SLO: • 1ms tail latency • 200K IOPS
Tenant B SLO: • 1.5ms tail latency • 100K IOPS
Request Cost Based Scheduling
23
1ms tail latency SLO
Device max IOPS: 375K
200K IOPS
100K IOPS
Tenant A SLO: • 1ms tail latency • 200K IOPS
Tenant B SLO: • 1.5ms tail latency • 100K IOPS
Tenant C SLO: • 0.5ms tail latency • 200K IOPS
75K slack
0.5ms tail latency SLO Tenant C cannot register since system cannot satisfy its SLO
Network Flash Application
libReFlex
Network RX queue
TCP/IP
Event Conditions
Batched Network/Storage Calls
NVMe
NVMe SQ
Scheduler
24
Summary: Life of a Packet in ReFlex
1
2
3
4
TCP/IP
Event Conditions
Batched Network/Storage Calls
NVMe TCP/IP
NVMe
Scheduler
25
6
7
Summary: Life of a Packet in ReFlex
Network Flash Application
libReFlex
8 Network RX queue
NVMe CQ
NVMe SQ
Network TX queue
5
0
200
400
600
800
1000
0 250 500 750 1000 1250
p95
Rea
d La
tenc
y (u
s)
IOPS (Thousands)
Local-1T
ReFlex-1T
Libaio-1T Linux-1T
Results: Local ≈ Remote Latency
ReFlex: 850K IOPS/core
over TCP
Linux: 75K IOPS/core
over TCP
26
Use 1 core
on Flash server
0
200
400
600
800
1000
0 250 500 750 1000 1250
p95
Rea
d La
tenc
y (u
s)
IOPS (Thousands)
Local-1T
ReFlex-1T
Libaio-1T
Results: Local ≈ Remote Latency
Latency Local Flash 78 µs ReFlex 99 µs Linux 200 µs
27
Linux-1T
0
200
400
600
800
1000
0 250 500 750 1000 1250
p95
Rea
d La
tenc
y (u
s)
IOPS (Thousands)
Local-1T Local-2T ReFlex-1T ReFlex-2T Libaio-1T Libaio-2T
Results: Local ≈ Remote Latency
28
Linux-1T Linux-2T
ReFlex: saturates Flash
Use 2 vs. 1 cores
on Flash server
Results: Performance Isolation
• Tenants A & B: latency-critical; Tenant C + D: best effort • Without scheduler: latency and bandwidth QoS for A/B are violated • Scheduler rate limits best-effort tenants to enforce SLOs
0
20
40
60
80
100
120
140
Tenant A Tenant B Tenant C Tenant D
IOPS
(T
hous
ands
)
I/O sched disabled I/O sched enabled
0 500
1000 1500 2000 2500 3000 3500 4000
Tenant A Tenant B Tenant C Tenant D
Rea
d p9
5 la
tenc
y (u
s)
I/O sched disabled I/O sched enabled
Latency SLO
Tenant A IOPS SLO
Tenant B IOPS SLO
100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd
29
Results: Performance Isolation
• Tenants A & B: latency-critical; Tenant C + D: best effort • Without scheduler: latency and bandwidth QoS for A/B are violated • Scheduler rate limits best-effort tenants to enforce SLOs
0
20
40
60
80
100
120
140
Tenant A Tenant B Tenant C Tenant D
IOPS
(T
hous
ands
)
I/O sched disabled I/O sched enabled
0 500
1000 1500 2000 2500 3000 3500 4000
Tenant A Tenant B Tenant C Tenant D
Rea
d p9
5 la
tenc
y (u
s)
I/O sched disabled I/O sched enabled
Latency SLO
Tenant A IOPS SLO
Tenant B IOPS SLO
100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd
30
Results: Performance Isolation
• Tenants A & B: latency-critical; Tenant C + D: best effort • Without scheduler: latency and bandwidth QoS for A/B are violated • Scheduler rate limits best-effort tenants to enforce SLOs
0
20
40
60
80
100
120
140
Tenant A Tenant B Tenant C Tenant D
IOPS
(T
hous
ands
)
I/O sched disabled I/O sched enabled
0 500
1000 1500 2000 2500 3000 3500 4000
Tenant A Tenant B Tenant C Tenant D
Rea
d p9
5 la
tenc
y (u
s)
I/O sched disabled I/O sched enabled
Latency SLO
Tenant A IOPS SLO
Tenant B IOPS SLO
100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd
31
Results: Scalability
§ ReFlex scales to multiple cores (we tested with up to 12 cores) § ReFlex scales to thousands of tenants and connections
32
Each tenant issues 100 1kB reads/sec
ReFlex can manage 5,000 tenants with a
single core.
How can I use ReFlex? § ReFlex is open source:
www.github.com/stanford-mast/reflex
§ Two versions of ReFlex available: 1. Kernel implementation:
– Provides a protected network-storage data plane (user application not trusted) – Builds on hardware support for virtualization – Requires Dune kernel module for direct access to hardware and memory
management in Linux 2. User space implementation:
– Network-storage data plane runs as user space application; kernel-bypass – Portable implementation, tested on Amazon Web Services i3 instance
ReFlex Clients
§ ReFlex exposes a logical block interface
§ ReFlex client implementations available: 1. Block I/O client
à For issuing raw network block I/O requests to ReFlex 2. Remote block device driver
à For legacy Linux applications
34
Control Plane for ReFlex
§ ReFlex provides an efficient data plane for remote access
§ Bring your own control plane or distributed storage system: • Global control plane allocates Flash capacity & bandwidth in cluster • Control plane can be a cluster manager or a distributed storage
system (e.g. distributed file system or database)
§ Example: ReFlex storage tier for Crail distributed file system • Spark applications can use ReFlex as a remote Flash storage tier,
managed by the Crail distributed data store
Example Use-Case
§ Big data analytics frameworks such as Spark store a variety of data types in memory, Flash and disk
§ Crail is a distributed storage system that manages and provides high-performance access to data across multiple storage tiers
36
SQL Streaming MLlib GraphX
SparkShuffledata Input/OutputfilesBroadcastdataRDDs
www.crail.io
Example Use-Case § Spark executors often use (local) Flash to store temporary data (e.g. shuffle)
§ Spark applications often saturate CPU resources before they can saturate NVMe Flash IOPS
à Local Flash bandwidth is underutilized 37
SparkExecutor
Compute
DRAM
NVMeFlash
SparkExecutor
Compute
DRAM
NVMeFlash
SparkExecutor
Compute
DRAM
NVMeFlash
SparkExecutor
Compute
DRAM
NVMeFlash
Example Use-Case § ReFlex provides a shared remote Flash tier for big data analytics
• Improves resource utilization while offering local Flash performance
38
SparkExecutor
High-BandwidthEthernetNetwork
Compute
DRAM
SparkExecutor
Compute
DRAM
SparkExecutor
Compute
DRAM
SparkExecutor
Compute
DRAM
ReFlex
NVMeFlash
ReFlex
NVMeFlash
Summary
§ ReFlex enables Flash disaggregation
§ Performance: remote ≈ local
§ Commodity networking, low CPU overhead
§ QoS on shared Flash with I/O scheduler