12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA...
Transcript of 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA...
![Page 1: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,](https://reader031.fdocuments.us/reader031/viewer/2022011823/5ecf76a9a3de60515354ea38/html5/thumbnails/1.jpg)
12th ANNUAL WORKSHOP 2016
PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George
Zhang, Shelley Gong
[ April 5th, 2016 ] VMware, Inc.
![Page 2: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,](https://reader031.fdocuments.us/reader031/viewer/2022011823/5ecf76a9a3de60515354ea38/html5/thumbnails/2.jpg)
OpenFabrics Alliance Workshop 2016
MOTIVATION
2
RDMA App
User
Kernel
Sockets API RDMA Verbs API
Host Channel Adapter
Kernel B
ypassNetwork Device
IPv4/IPv6
TCP
Sockets
Device Driver
Buffers Buffers
Buffer H
eaders
InfiniBand iWARP(Internet Wide Area RDMA
Protocol)
RoCE(RDMA over Converged Ethernet)
InfiniBandSwitch
Ethernet Switch
Socket App RDMA Enables § OS bypass § Zero-copy § Low Latency (<1µs) § High Bandwidth
Why not PCI Passthrough? § No live migration support § Transport dependent § Needs an HCA § Cannot share non-SRIOV HCA
![Page 3: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,](https://reader031.fdocuments.us/reader031/viewer/2022011823/5ecf76a9a3de60515354ea38/html5/thumbnails/3.jpg)
OpenFabrics Alliance Workshop 2016
INTRODUCTION
• Paravirtual RDMA (PVRDMA) is a new PCIe virtual NIC • Supports standard Verbs API • Uses HCA for performance, but works without it • Multiple virtual devices can share an HCA without SR-IOV • Supports vMotion (live migration)!
3
![Page 4: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,](https://reader031.fdocuments.us/reader031/viewer/2022011823/5ecf76a9a3de60515354ea38/html5/thumbnails/4.jpg)
OpenFabrics Alliance Workshop 2016
ARCHITECTURE
• Exposes a dual function PCIe device to the
guest • VMXNET3 • RDMA (RoCE)
• RDMA component reuses Ethernet properties from the paired NIC
• Plugs into the OFED stack in the VM • Provides verbs-level emulation
• Guest kernel driver • User level library
• Operates over ESX RDMA stack(VMkernel) • GIDs generated by guest kernel registered with
HCA
4
Guest OS 1
RDMA App Buffers
libvrdma libibverbs
PVRDMA Driver
PVRDMA NIC
PVRDMA Device Emulation
![Page 5: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,](https://reader031.fdocuments.us/reader031/viewer/2022011823/5ecf76a9a3de60515354ea38/html5/thumbnails/5.jpg)
OpenFabrics Alliance Workshop 2016
ARCHITECTURE (CONT.)
• Virtualize some hardware resources (like QPs and MRs) • Required for vMotion • Create corresponding
physical resources on the HCA
5
WQE WQE
QP SQ RQ
WQE WQE
vQP
CQE
vCQ SQ RQ
CQE
CQ HCA
Emulation layer
Network
Post Poll
Virtual MR/QPs -> Physical Physical QPs -> Virtual
![Page 6: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,](https://reader031.fdocuments.us/reader031/viewer/2022011823/5ecf76a9a3de60515354ea38/html5/thumbnails/6.jpg)
OpenFabrics Alliance Workshop 2016
ARCHITECTURE (CONT.)
• Guest MR registered directly with the HCA • Guest PA converted to
machine addresses • Zero-copy
6
Applica-on buffer
Guest VA, length
Guest PA list
MA (host PA) list
Guest VA -‐> MA list
Guest userspace
Guest kernel
Device emulation
HCA
HCA
![Page 7: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,](https://reader031.fdocuments.us/reader031/viewer/2022011823/5ecf76a9a3de60515354ea38/html5/thumbnails/7.jpg)
OpenFabrics Alliance Workshop 2016
CONTROL AND DATA PATH
7
Guest OS 1
RDMA App Buffers
libvrdma libibverbs
PVRDMA Driver
PVRDMA NIC
ESXi RDMA Stack
HCA Device Driver
HCA
Guest OS 2
RDMA App Buffers
libvrdma libibverbs
PVRDMA Driver
PVRDMA Device Emulation
ESXi RDMA Stack
HCA Device Driver
RoCE (RDMA over Converged
Ethernet)
PVRDMA Device Emulation
Control Path
Data Path
![Page 8: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,](https://reader031.fdocuments.us/reader031/viewer/2022011823/5ecf76a9a3de60515354ea38/html5/thumbnails/8.jpg)
OpenFabrics Alliance Workshop 2016
RDMA TRANSPORT SELECTION
8
• PVRDMA Transport Selection • Memcpy – RDMA between peers on same host • TCP – RDMA between peers without HCAs (slow path) • RDMA – Fast Path RDMA between peers with HCAs
• PVRDMA vMotion • Leverage transport selection to support vMotion of RDMA VMs
vSphere Distributed Switch
ESX Host 1 ESX Host 2 ESX Host 3
HCA HCA RDMA NIC TCP
Memcpy
![Page 9: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,](https://reader031.fdocuments.us/reader031/viewer/2022011823/5ecf76a9a3de60515354ea38/html5/thumbnails/9.jpg)
OpenFabrics Alliance Workshop 2016
vMOTION
9
§ Challenge:-
• Lots of RDMA state within hardware • Physical resource IDs (like QPNs/MR keys) may change after
migration • Peers will not be aware of the new IDs • Currently, no support to create resources with specified IDs
![Page 10: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,](https://reader031.fdocuments.us/reader031/viewer/2022011823/5ecf76a9a3de60515354ea38/html5/thumbnails/10.jpg)
OpenFabrics Alliance Workshop 2016
vMOTION
10
§ Current (partial) solution:- • Emulation layer can get virtual to physical translations from peer • Notify peer about vMotion and pause QP/CQ processing • After vMotion resume QPs with the new translations • Invisible to guest • Can only work when both endpoints are VMs
![Page 11: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,](https://reader031.fdocuments.us/reader031/viewer/2022011823/5ecf76a9a3de60515354ea38/html5/thumbnails/11.jpg)
OpenFabrics Alliance Workshop 2016
vMOTION (FUTURE WORK)
11
• Support vMotion when one of endpoints is native (non-VM) • Need hardware support • Recreate specific QPNs and MR keys • Ability to pause and resume QP state on the hardware
• Save/Restore intermediate QP states
• Provide isolated resource space to each PVRDMA device • Guarantee that specified resources can be recreated • Avoid collisions with existing resources
• Expose hardware resources directly to guest • Lower virtualization overhead
![Page 12: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,](https://reader031.fdocuments.us/reader031/viewer/2022011823/5ecf76a9a3de60515354ea38/html5/thumbnails/12.jpg)
OpenFabrics Alliance Workshop 2016
PERFORMANCE
12
§ Testbed • 2 x Dell T320 Hosts E5-2440 @ 2.40GHz, 24 GiB, Mellanox ConnectX - 3 • VMs: Ubuntu 12.04, 3.5.0.45, x86_64, 2 vCPUs, 2 GiB • OFED Send Latency Test
• Half RTT for 10K iterations
![Page 13: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,](https://reader031.fdocuments.us/reader031/viewer/2022011823/5ecf76a9a3de60515354ea38/html5/thumbnails/13.jpg)
OpenFabrics Alliance Workshop 2016
CURRENT LIMITATIONS
13
• Communication between VM and native endpoints not supported
• Need a way to create resources with specified IDs • May need additional hardware support from vendors • Formalize vMotion support on hardware
• Currently only supports RoCEv1 in the guest • Can still operate over underlying RoCEv2-only HCA • No InfiniBand/iWARP support (future work)
• No remote READ/WRITE support on DMA MRs • No SRQ/Atomics support yet
• SRQs not currently supported on host ESX
• Only supports Linux guests currently • No failover support for PVRDMA