InfiniBand/RDMA for Storage - SRP vs. iSER
-
Upload
sebastian-parschauer -
Category
Technology
-
view
12.178 -
download
4
description
Transcript of InfiniBand/RDMA for Storage - SRP vs. iSER
![Page 1: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/1.jpg)
InfiniBand/RDMA for Storage –SRP vs. iSER
Sebastian RiemerLinux Kernel Developer – Storage
23.05.2013
![Page 2: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/2.jpg)
Structure
● RDMA Basics● RDMA Hardware
● InfiniBand, iWARP, RoCE● RDMA Software + Network Protocols● SRP vs. iSER
RDMA for Storage 2/28 23.05.2013
![Page 3: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/3.jpg)
RDMA Basics
RDMA for Storage 3/28 23.05.2013
![Page 4: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/4.jpg)
Remote Direct Memory Access (RDMA)
RDMA for Storage 4/28 23.05.2013
![Page 5: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/5.jpg)
Latency
RDMA for Storage 5/28 23.05.2013
e.g. 4k sync. reads, status/information requests, ...
![Page 6: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/6.jpg)
RDMA MTU
● RDMA MTU: 256, 512, 1024, 2048, 4096 Bytes● MTU : Throughput , Transfer Latency ● Max. MTU is settable● Active MTU is determined● InfiniBand: RDMA MTU is native● iWARP/RoCE: RDMA MTU must fit into Ethernet
MTU: 1500 → 1024 Bytes
RDMA for Storage 6/28 23.05.2013
![Page 7: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/7.jpg)
RDMA Hardware
RDMA for Storage 7/28 23.05.2013
![Page 8: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/8.jpg)
InfiniBand (IB)
● Switched fabric interconnect● Arbitrary topologies: Fat Tree, Mesh, Lash,...● Point-to-point bidirectional serial links● Used in HPC and Enterprise Data Centers● QDR 10 Gbit/s, FDR 14 Gbit/s per lane● Lanes: 4● Low end-to-end latency < 2 µs (1 GbE: 35 µs)
RDMA for Storage 8/28 23.05.2013
![Page 9: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/9.jpg)
InfiniBand (IB)
● Subnet Manager (SM)● LID (16 bit) and GID (128 bit) addressing● GID = 64 bit subnet prefix + 64 bit GUID● Max. 128 partitions (like VLANs)● QoS, reliability and scalability● Credit-based flow control → no packet loss
RDMA for Storage 9/28 23.05.2013
![Page 10: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/10.jpg)
InfiniBand Congestion
● Congestion Control (CC) not ready, yet● CC = tell SM to tell others to reduce their speed● Reduce MTU, set QoS, set IO limits, multipath
RDMA for Storage 10/28 23.05.2013
BLOCKED,NO CREDITS,
(tell SM)
master SM slave SM
![Page 11: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/11.jpg)
Host Channel Adapters (HCA)
● IB counterpart of NICs● Communicate via a Queue Pair (QP) constisting
of Send Queue (SQ) and Receive Queue (RQ)● Reliable/Unreliable, Connected/Disconnected ● Support for atomic operations● Error counters in HW
RDMA for Storage 11/28 23.05.2013
![Page 12: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/12.jpg)
Host Channel Adapters (HCA)
Mellanox QDRdriver: mlx4_ib
ConnectX-2 VPI
RDMA for Storage 12/28 23.05.2013
QLogic/Intel QDRdriver: qib
7300 Series
better for the DC/cloud
![Page 13: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/13.jpg)
Internet Wide Area RDMA Protocol (iWARP)
● RDMA Network Interface Card (RNIC)● Connection-oriented (TCP), only RDMA
technology routable through the Internet● Reliable Connected (RC) only● Latency, bandwidth: >= 3 µs, usually 10 Gbit/s● Vendors: Chelsio (driver cxgb3/4),
Intel NetEffect (driver nes)
RDMA for Storage 13/28 23.05.2013
![Page 14: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/14.jpg)
RDMA over Converged Ethernet (RoCE)
● Limited to a single Ethernet broadcast domain● InfiniBand frame encapsulation (IBoE)● GID is composed of MAC address + reserved● Better suited upon congestion● Scaling issues in big data center setups● Latency, bandwidth: < 2 µs, 10/40 Gbit/s● Vendors: Mellanox (driver mlx4_en),
Emulex (driver ocrdma),
RDMA for Storage 14/28 23.05.2013
![Page 15: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/15.jpg)
RDMA Software + Network Protocols
RDMA for Storage 15/28 23.05.2013
![Page 16: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/16.jpg)
OpenFabrics Enterprise Distribution (OFED)
● Approx. 30 SW packets● Upstream version: 3.5● IB Verbs: Hardware/OS abstraction layer● One IB verbs user-space driver per RDMA HW● IB Subnet Management (e.g. opensm)● Communication Management (CM)● Performance and diagnosis tools + utilities
RDMA for Storage 16/28 23.05.2013
![Page 17: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/17.jpg)
RDMA Network Protocols
● IP over InfiniBand (IPoIB)● iSCSI Extensions for RDMA (iSER)● SCSI RDMA Protocol (SRP)● Network File Systems (NFS-RDMA)● Distributed File Systems (GlusterFS, Lustre)
RDMA for Storage 17/28 23.05.2013
![Page 18: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/18.jpg)
SRP vs. iSER
RDMA for Storage 18/28 23.05.2013
![Page 19: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/19.jpg)
iSCSI Extensions for RDMA (iSER)
RDMA for Storage 19/28 23.05.2013
● SolarisCOMSTAR
● (LIO isert, kernel 3.10)
● STGTuser
kernel
● Mellanox pushes iSER and STGT
● No advanced features with STGT like live resizing
● ProfitBricks chose Solaris for ZFS and iSER
● LIO isert is too new
Target
![Page 20: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/20.jpg)
iSCSI Extensions for RDMA (iSER)
RDMA for Storage 20/28 23.05.2013
● ib_iser ● libiscsi● scsi_transport_iscsi● (ib_ipoib)
● iscsiduser
kernel
● Complexity● Multiple maintainers● Major IPoIB bugs● IP-based DDoS reconnect● Mellanox is mainly
improving performance● Too unstable for IB
open-iscsi Initiator
![Page 21: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/21.jpg)
SCSI RDMA Protocol (SRP)
RDMA for Storage 21/28 23.05.2013
● SCST ib_srpt● Solaris COMSTAR● (LIO ib_srpt)
user
kernel
● Very committed SCST maintainers Bart and Vlad (Bart Van Assche,Vladislav Bolkhovitin)
● ProfitBricks chose SCST due to ZFS and iSER issues
● LIO SRP unstable/unusable
Target
![Page 22: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/22.jpg)
SCSI RDMA Protocol (SRP)
RDMA for Storage 22/28 23.05.2013
● ib_srp● scsi_transport_srp
● (srp-tools)user
kernel
● Simplicity: RDMA-only, kernel-only possible
● Inactive Maintainer● No fast IO failing, no
continuous reconnect● Loosing SCSI disks● Bart + Mellanox are active● Bart's work doesn't fit us
Initiator
![Page 23: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/23.jpg)
ProfitBricks Choices
● Simplicity = Stablity → SRP without srp-tools● Help improving SCST● Improved SRP initiator ourselves
● Just fast IO failing + automatic reconnect● Never loose SCSI devices automatically
● Published SRP initiator fixes● Implement RDMA into QEMU for performance
RDMA for Storage 23/28 23.05.2013
![Page 24: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/24.jpg)
SRP Fixes
● From Bart: https://github.com/bvanassche/ib_srp-backport
● From ProfitBricks: https://github.com/sriemer/ib_srp
● Bart also has performance patches + backport● Bart uses the srp-tools + loosing SCSI devices● Gradually finding compromises
RDMA for Storage 24/28 23.05.2013
![Page 25: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/25.jpg)
● THCA_GUID="0002c903004ed0b2"
● TGID_P1="fe800000000000000002c903004ed0b3"
● PKEY="ffff"
● IHCA="mlx4_0"
● IHCA_P1="1"
● SRP=“id_ext=${THCA_GUID},ioc_guid=${THCA_GUID},dgid=${TGID_P1},pkey=${PKEY},service_id=${THCA_GUID}“
● echo "${SRP}" > /sys/class/infiniband_srp/srp-${IHCA}-${IHCA_P1}/add_target
Establish an SRP connection
RDMA for Storage 25/28 23.05.2013
![Page 26: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/26.jpg)
InfiniBand/RDMA Links/Information
● InfiniBand Trade Association(IB specification, doc, www.infinibandta.com)
● OpenFabrics Alliance (OFA, OFED providers, www.openfabrics.org)
● Mellanox Technologies (www.mellanox.com)● [email protected] mailing list● LinkedIn group „InfiniBand Technologists“
RDMA for Storage 26/28 23.05.2013
![Page 27: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/27.jpg)
Questions?
● Questions???
● [email protected]● www.profitbricks.com
RDMA for Storage 27/28 23.05.2013
![Page 28: InfiniBand/RDMA for Storage - SRP vs. iSER](https://reader034.fdocuments.us/reader034/viewer/2022052123/5551510ab4c905f2288b5567/html5/thumbnails/28.jpg)
Bonus: How to do replication right?
RDMA for Storage 28/28 23.05.2013
Primary Secondary Primary Primary LUN LUN
IP IP
ClusterManager
ClusterManager
WRONG!Store&ForwardWrites! Slow!
WRONG!Complex,
error-prone!
SRP/iSER/iSCSI
SRP/iSER/iSCSI
SRP/iSER/iSCSI
SRP/iSER/iSCSI
SRP/iSER/iSCSI
e.g. SW RAID-1
RIGHT!Simple
and fast!