FVM: FPGA-assisted Virtual Device Emulation for Fast ...
Transcript of FVM: FPGA-assisted Virtual Device Emulation for Fast ...
FVM: FPGA-assisted Virtual Device Emulation
for Fast, Scalable, and Flexible Storage Virtualization
Dongup Kwon1,2, Junehyuk Boo1, Dongryeong Kim1, and Jangwoo Kim1,2
1 Department of Electrical and Computer Engineering, Seoul National University2 Memory Solutions Lab, Samsung Semiconductor Inc.
Background: NVM Express (NVMe) Storage
• Provide high I/O performance through PCIe
− Utilize multiple I/O submission/completion queue (SQ/CQ) pairs
− Enable highly parallel I/O processing on multiple CPU cores
Host memory
⋯
Core 1I/O queue
SQ CQ
Core 2I/O queue
SQ CQ
Admin queue
SQ CQ
Core nI/O queue
SQ CQ
NVMe storage is widely used in modern datacenters to accelerate I/O
NVMe storage
2/32
Background: HW-assisted NVMe Virtualization
• Utilize single-root I/O virtualization (SR-IOV)
− Create multiple physical/virtual functions (PFs/VFs) internally
− Assign each VF to a VM exclusively and allow direct access to HW
− Assignable resources: virtual queues (VQs), virtual interrupts (VIs)
NVMe storage
Namespace Namespace
VQ VI
Namespace
VQ VI
Namespace
VQ VI⋯
Host software VM 0 VM 1 VM n⋯
PF VF 0 VF 1 VF n
SR-IOV can provide near-native storage performance to multiple VMs 3/32
Management
Background:Limitations of SR-IOV• Limited VM-management
features and use cases
- No interposition layer bewteen VMs and storage
- Inflexible storage resource allocation
• Limited compatibility
− Vendor-specific and hard-wired implementations
Category Feature SR-IOV
Storage configuration
Consolidation
Aggregation
Caching
Resourcemanagement
Replication
Throttling
AdministrationMigration
Metering
SR-IOV loses flexibility to implement critical VM-management features4/32
Outline
• Background
• Motivation
− SW-based host sidecore / on-device sidecore approaches
• FVM: FPGA-assisted Storage Virtualization
• Evaluation
• Conclusion5/32
Alternative #1:SW-based Host Sidecore Approach
• Dedicate CPU cores to emulate virtual devices
• Accelerate storage virtualization layers
- Avoid expensive traps to a hypervisor and cache pollution
VM
NVMe device driver
Sharedmem
NVMe SQ/CQ
NVMe DB
Host software
Sidecoresx86 x86 x86
SQ CQ SQ CQ SQ CQ
Linux block I/O
DMA buffer
I/O emulation
User-level NVMe driver
Host sidecore approaches accelerate virtualization by dedicating CPU cores6/32
Limitations of Host Sidecores
• Expensive and non-scalable virtualization
- Polling guest I/O activities + indirect interrupt injection
- Demand 40%-60% more CPU resources than native I/O operations
- Limited VM performance or scalability due to lack of CPU resources
x86
Native I/O Virtualized I/O
VM applications Sidecores
x86 x86 x86 x86 x86 x86 x86 x86 x86
Host sidecore approaches should pay the expensive virtualization tax
SSD SSD
7/32
Applications
Alternative #2:On-device Sidecore Approach
• Offload a virtualization layer to SoC cores
− Emulate guest I/O via SoC cores in other peripheral devices
• Save the host resources required for storage virtualization
VM
SoC SoC SoC SQ CQ
SmartNIC
NVMe device driver
Linux block I/O
NVMe SQ/CQ
NVMe DB DMA buffer
PCIe
SQ CQ
SQ CQ
Runtime SW
I/Oemulation
On-device sidecores can reduce the virtualization tax8/32
Limitations of On-device Sidecore
• Weak computing power of SoC cores
− Cannot support a large number of VMs, virtual/physical devices
SoC SoC SoC SQ CQ
VM NVMe SSD
SoC SoC SoC SQ CQ
VM NVMe SSD
Single VM, single SSD Large # of VMs & SSDs
On-device sidecores suffer from limited performance and scalability9/32
SmartNICSmartNIC
Design Goals
SR-IOVCPU FVM
Performance1
SoC
Host efficiency2
Scalability3
Flexibility4
Sidecore
Compatibility510/32
Outline
• Background
• Motivation
• FVM: FPGA-assisted Storage Virtualization
• Evaluation
• Conclusion
11/32
Key Ideas and Benefits
VM
NVMe device driver
Sharedmem
Host software
DMA buffer
NVMe SQ/CQ
FVM engine driver
Hugepages
gPA → hPA table
High-levelsynthesis (HLS)
FVM engine (FPGA board)
SQ CQ SQ CQ SQ CQ
SR-IOV
I/O emulation
VM management
gPA → hPA
NVMe DB
• FPGA-assisted virtualization
− HW-based
− Host-decoupled
− Scalable
− Flexible
− Programmable
FVM enables fast, scalable, and flexible storage virtualization12/32
Key Idea #1: HW-level Virtualization Layer
• Utilize a decoupled FPGA for device emulation
• Allow direct access to FVM engine from a VM environment
− Save the host resource and enable fast virtualized I/O paths
FVM engine
I/O emulation⋯
Host software VM 0 VM 1 VM n⋯
PF VF 0 VF 1 VF n
I/O emulationI/O emulationManagement
FVM emulates virtual storage devices without software arbitration
SSD SSD SSD SSD SSD SSD SSD SSDSSD SSD
13/32
Key Idea #2:Scalable Virtualization Layer
• Create many front-end / back-end resources
− Front-end: FVM core – Poll and emulate guest I/O operations
− Back-end: NVMe interface – Manage and control SSDs through PCIe
− Can scale with a large number of VMs and SSDs
FVM engine
SR-I
OV
NVM
e D
B
FVMcore
NVMeIntf
Cro
ssbar
SQVM SSD
Front-end resources Back-end resources
FVM can scale up the virtualization resources with a target storage system
Many VMs Many SSDs
14/32
Key Idea #3: Direct Device-Control Mechanism
• Implement NVMe interfaces on FVM engine
• Issue and handle NVMe commands / completions
− Interact with NVMe storage devices at the hardware level
FVM engineNVMe
storage
DMA write
DMA write
DMA read
DMA write
SQ DB
CQ DBN
VM
ein
terf
ace
PCIeP2P
①
②
④⑤
⑥
SQ
CQ
NVMe storage
FVM engine
Host DRAM
SQ CQ
SQ CQ
PCIe switch ③
FVM manages physical NVMe devices through PCIe P2P15/32
Key Idea #4:HLS-based Design Flow
• Support C/C++ high-level languages
• Allow users to extend virtualization features easily
• Exmaple features
− Consolidation
− Caching
− Replication
− Throttling
− Direct (D2D) copy
Category Feature SR-IOV FVM (LoC)
Storage configuration
Consolidation (40)
Caching (220)
Resourcemanagement
Replication (15)
Throttling (70)
Administration Direct copy (570)
FVM supports easy VM management and feature programmability16/32
Outline
• Background
• Motivation
• FVM: FPGA-assisted Storage Virtualization
− Key ideas / end-to-end I/O paths
• Evaluation
• Conclusion17/32
End-to-End Submission Path (1/2)VM
NVMe device driver
NVMeSQ/CQ
FVM engine driver
FVM engine SR-IOV
DMAbuffer
Crossbar
FVM coresNVMe DB
NVMeinterface
SQ CQ SSD
②
③ ④
⑤
⑥
⑦
⑧
Polling NVMe cmds
NVMe cmds
Data
DB write
①
18/32
End-to-End Submission Path (2/2)VM
NVMe device driver
NVMeSQ/CQ
FVM engine driver
FVM engine SR-IOV
DMAbuffer
Crossbar
FVM coresNVMe DB
NVMeinterface
SQ CQ SSD
②
③ ④
⑤
⑥
⑦
⑧
Polling NVMe cmds
NVMe cmds
Data
DB write
①
19/32
End-to-End Completion Path (1/2)VM
NVMe device driver
NVMeSQ/CQ
FVM engine driver
FVM engine
DMAbuffer
FVM coresNVMe DB
NVMeinterface
SQ CQ SSD
①
②
SR-IOV
Crossbar④
③
DB write
Data
Completions
⑨
⑦
⑤-1
⑤-2
20/32
⑧
Completions / interrupts
⑥
End-to-End Completion Path (2/2)VM
NVMe device driver
NVMeSQ/CQ
FVM engine driver
FVM engine
DMAbuffer
FVM coresNVMe DB
NVMeinterface
SQ CQ SSD
①
②
⑤-1
SR-IOV
Crossbar④
③
DB write
Data
Completions
Completions / interrupts
⑤-2
21/32
⑨
⑦
⑧
⑥
VM-management Feature: ThrottlingVM
NVMe device driver
NVMeSQ/CQ
FVM engine driver
FVM engine
DMAbuffer
NVMe DB
NVMeinterface
SQ CQ SSD
FVM cores Token bucketFVM cores
Token generatorCrossbar
if refill_token then
token_bucket += token
...
if size < token_bucket then
submit_cmd(cmd)
token_bucket -= size
...
22/32
SR-IOV
VM-management Feature: D2D copyVM
NVMe device driver
NVMeSQ/CQ
FVM engine driver
FVM engine
DMAbuffer
NVMe DB
NVMeinterface
SQ CQ
FVM cores Cmd generator
Intermediate buffers
...
cmd = generate_cmd(src, dst)
buf = alloc_buffer()
cmd_list = split_cmd(cmd,
sdfsdbuf.size , buf.addr)
…
4KB
SSD
FVM cores
Crossbar
23/32
SR-IOV
Implementation
• Prototype
− 2x 12-core Xeon 5118 / 256GB
− 5x 480GB NVMe SSDs
− Xilinx Alveo U280 Card
• Based on open-source SW frameworks
− Linux kernel v5.3
− KVM/QEMU v3.0
− SPDK vhost-nvme v20.01
2x IntelXeon Gold 5118
1x XilinxAlveo U280
5x Intel Optane 900p480GB SSDs
24/32
Outline
• Background
• Motivation
• FVM: FPGA-assisted Storage Virtualization
• Evaluation
• Conclusion
25/32
Evaluation Methodology
• FVM vs host sidecore, passthrough (ideal perf.)
• Random I/O performance from VMs
− FIO random-read/write/rw (4 threads, 32 queue depth)
− I/O throughput, host CPU usage measurement
• RocksDB performance with multiple threads from VMs
− (A) 50% read, (B) 95% read, (C) read-only, (D) read-latest, (E) short-range, (F) read-modifiy-write workloads
− RocksDB operation throughput measurement
• Scability test with multiple VMs and SSDs
26/32
Random I/O Performance
• Random I/O with limited CPU usage (4 cores)
− Host sidecore: incur CPU contention between VMs and sidecores
− Passthrough/FVM: decouple virtualization from host resources
0.00
0.25
0.50
0.75
1.00
Rand-read Rand-write Rand-rw
Th
ro
ug
hp
ut
(n
orm
alized
) Native vhost-nvme Passthrough FVMSidecore
2.7GB/s 2.8GB/s 2.7GB/s
FVM achieves 1.37x – 1.42x higher I/O throughput than sidecore approaches27/32
RocksDB Operation Throughput
• RocksDB workloads with 4-8 CPU cores
− Host sidecore: limit VM performance due to lack of host resources
− FVM: save host resources and offer more compute power to VMs
0.0
1.0
2.0
1 2 1 2 3 4
# CPU cores = 4 # CPU cores = 8
Sp
eed
up
(vs s
ideco
re)
# of CPU cores and vhost sidecores
A B C D E F
# of sidecores / # of total allocated cores (sidecore util.)
~ 70% improvement
FVM achieves higher I/O throughput by offering more CPUs to VMsWith FVM, host CPUs are better spent for VM workloads and user applications 28/32
1/4(25%)
50% 50%25%12.5% 37.5%2/4(50%)
1/8(12.5%)
2/8(25%)
3/8(37.5%)
4/8(50%)
Scalability Test
• An increasing # of VMs (1 SSD/4 cores/VM)
− Host sidecore: pay the virtualization tax with an increasing # of VMs
− FVM: scale up the virtualization resources with a target # of VMs/SSDs
0
5
10
1 2 3 4
Th
ro
ug
hp
ut
(G
B/s)
Number of VMs & SSDs
Native vhost-nvme Passthrough FVMSidecore
FVM achieves higher I/O throughput by offering more CPUs to VMsFVM provides scalable performance by adding more FVM cores or FPGAs
9.5GB/s
29/32
Example Feature: D2D Copy
• Direct device-to-device copy through P2P
− Vs. SW-based indirect data copy
− Host resource saving: CPU, host memory / root complex BW
0
0.2
0.4
0.6
0.8
1
CP
U u
sag
e
(n
orm
alized
)
0
500
1000
1500
2000
2500
PC
Ie R
C B
W
(M
B/
s)
8.170
400
800
1200
1600
Mem
ory B
W
(M
B/
s)
30/32
Discussion & Conclusion
• Cost analysis
− CPU saving: 20 cores in a 64-core machine ≈ $2000 - $6400
− Small FPGA resource usage for FVM engine
• FVM: FPGA-assisted storage virtualization
− HW-based and scalable virtualization layer
− Direct device-control mechanism
− HLS-based design flow
• Implementation with off-the-shelf FPGA/SSD devices
− 1.37x – 1.42x higher I/O performance than sidecore approaches
− 9.5 GB/s aggregate throughput with 4 NVMe SSDs
31/32
Thank You!
FVM: FPGA-assisted Virtual Device Emulation for Fast, Scalable, and Flexible Storage Virtualization
Dongup Kwon, [email protected]
32/32