Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6...

46
Fabric Consolidation with InfiniBand Dror Goldenberg, Mellanox Technologies

Transcript of Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6...

Page 1: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand

Dror Goldenberg, Mellanox Technologies

Page 2: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 22

SNIA Legal Notice

The material contained in this tutorial is copyrighted by the SNIA. Member companies and individual members may use this material in presentations and literature under the following conditions:

Any slide or slides used must be reproduced in their entirety without modificationThe SNIA must be acknowledged as the source of any material used in the body of any document containing material from these presentations.

This presentation is a project of the SNIA Education Committee.Neither the author nor the presenter is an attorney and nothing in this presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney.The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information.

NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.

Page 3: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 3

Abstract

In the era of exploding datacenter requirements for networking and storage services, and with the increasing power, space, and budget concerns over the infrastructure, fabric consolidation becomes inevitable. InfiniBand was designed from day one for fabric consolidation. With 120Gb/s links and with ultra low-latency characteristics, InfiniBand provides a well provisioned foundation for consolidation of networking and storage. Additional features such as QoS, partitioning, virtual lanes, lossless fabric, and congestion management facilitate true consolidation of fabrics along with connectivity of InfiniBand islands to Ethernet and Fibre Channel clouds through gateways. This session highlights the features for fabric consolidation and the various protocols that run over InfiniBand with emphasis on storage protocols.

Learning objectivesUnderstand the InfiniBand architecture and feature set.Understand the benefits of InfiniBand for fabric consolidation. Understand the standard InfiniBand storage protocols.

Page 4: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 4

Agenda

Motivation and General OverviewProtocol Stack Layers Storage Protocols over InfiniBandBenefits

Page 5: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved.

Host

Motivation for Fabric Consolidation

5

Slower I/O drives more NICs, HBAs, wiresDifferent service needs drive different fabricsNo flexibilityMore ports, fabrics to manageMore powerMore space…Higher TCO

StorageApp

NetworkApp

Mgt.App

FC HCAFC HCA

GbE NICGbE NIC

GbE NICGbE NIC

GbE NICGbE NIC

Page 6: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 6

Fabric Consolidation

High bandwidth pipe for capacity provisioningDedicated I/O channels enable convergence

For Networking, Storage, Management trafficQoS – across different traffic typesPartitions and isolation provided

FlexibilitySoft servers and fabric repurposing

Host

NetworkingApp

ManagementApp

StorageApp

IB HCAOne Wire

Page 7: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved.

Consolidation is Real !

Los Alamos National Lab Coyote Cluster1,408 nodes – “all IB” cluster11.26 Tera-Flops (theoretical peak)

Tier one Data Base VendorHighly Efficient Data Center in a BoxIPC and Storage over InfiniBand

Major e-Commerce Hosting CompanyFabric consolidation with IB to FC and EN GatewaysSignificant saving on infrastructure

7

Page 8: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 8

Why InfiniBand ?

Superior performance 40Gb/s host/target ports120Gb/s switch to switchSub 1µs end to end latenciesAggressive roadmap

Unified fabric for the Data Center Storage, networking & clustering over a single wireScalable to 1000s nodes

Cost EffectiveCompelling price/performance advantage over alternative technologies

Low power Consumption – Green ITLess than 0.15W per Gb/s

Mission CriticalHighly reliable fabricMulti-pathingAutomatic failoverHighest level of data integrity

Page 9: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved.

Fabric Technologies Comparison

FeaturesFibre Channel

8G FCEthernet10GigE

InfiniBand4X QDR

Line Rate(GBaud) 8.5 10.3125 40

Unidirectional Throughput (GBytes/s)

800 1,250 4,000*

Fabric Consolidation Practically no FCoE coming soon … Yes

Copper Distance 15m 10GBASE-CX4 15m10GBASE-T 100m

Passive 7mActive 15m

Optical Distance† 100m 10GBASE-SR 300m 100-300m

* Theoretical, 3.25 GB/s measured due to server I/O limitations† Data center oriented media

9

Page 10: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved.

Physical Layer

Width (1X, 4X, 8X, 12X) including auto-negotiationSpeed (SDR/DDR/QDR/EDR) including auto-negotiation

4X QDR HCAs and switches are currently shipping

Power managementConnector

Board: MicroGiGaCN*Pluggable: QSFP

8/10 encodingMaintain DC BalanceLimited run length of 0’s or 1’s

Control symbols (Kxx.x)Lane de-skewAuto negotiation / trainingClock toleranceFraming

Lane Speed →

SDR (2.5GHz)

DDR (5GHz)

QDR (10GHz)

EDR (20GHz)

Link Width ↓

1X 2.5 5 10 20

4X 10 20 40 80

8X 20 40 80 160

12X 30 60 120 240

Link Speed (109 bit/sec)

* MicroGiGaCN is a trademark of Fujitsu Components Limited

10

Page 11: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved.

Physical Layer – Cont’d

* Currently deployed† Sampling‡ Per End

Width Speed Connector Reach Type / Power ‡

Fiber Media

4X SDR/DDR/QDR

Micro-GiGaCN /QSFP

300m/150m/100m

Media Converter0.8-1.5W

12 strand MPO fiber

4X DDR/ QDR

Micro-GiGaCN /QSFP†

300m/100m

Optical Cable1-3W

12 strand attached

Fiber Optics*:

4X MicroGiGaCNWidth Speed Connector Reach Type / Power ‡

4X SDR/DDR/QDR

Micro-GiGaCN/ QSFP

20m/10m/7m

Passive

4X DDR Micro-GiGaCN 15-25m Active0.5-1.5W

4X QDR QSFP† 12-15 Active1-2W

12X SDR/DDR

24pin Micro-GiGaCN 20m/10m

Passive

Copper Cables*:

4X MicroGiGaCNMedia Converter

12X MicroGiGaCN

4X MicroGiGaCNOptical Cable

4X QSFP

4X QSFPMedia Converter

11

Page 12: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 12

Link Layer

Addressing and SwitchingLocal Identifier (LID) addressingUnicast LID - 48K addressesMulticast LID – up to 16K addressesEfficient linear lookupCut through switching (ultra low latency)Multi-pathing support through LMC

Independent Virtual LanesFlow control (lossless fabric)Service level VL arbitration for QoS

Congestion controlForward / Backward Explicit Congestion Notification (FECN/BECN)

Data IntegrityInvariant CRCVariant CRC

HighPriority

WRR

LowPriority

WRR

PrioritySelect

Packetsto be

Transmitted

H/L Weighted Round Robin (WRR) VL Arbitration

Efficient FECN/BECN Based Congestion Control

Switch

threshold

FECN

BECNBECN

HCA HCA

Per QP/VLinjection rate control

Independent Virtual Lanes (VLs)VL AR

B

Page 13: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 13

Fabric Consolidation –VLs and Scheduling Example

VLs and scheduling can be dynamically configured and adjusted to match application performance requirements

InfiniBand fabric

Low Latency VLFor Clustering

Mainstream Storage VLDay - at least 40% BWNight – at least 20% BW

Backup VLDay – at least 20% BWNight – at least 60% BW

Physical:

Logical:

Page 14: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 14

Network Layer

Global Identifier (GID) addressingIPv6 addressing schemeGID = {64 bit GID prefix, 64 bit GUID}

GUID = Global Unique Identifier (64 bit EUI-64)GUID 0 – assigned by the manufacturerGUID 1..(N-1) – assigned by the Subnet Manager

Optional for local subnet accessUsed for multicast distribution within end nodesEnables routing between IB subnets

Definition underway in IBTALeverages IPv6 routing algorithms

Subnet A Subnet BIB Router

Page 15: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 15

Transport –Host Channel Adapter (HCA) Model

Asynchronous interfaceConsumer posts work requestsHCA processesConsumer polls completions

I/O channel exposed to the applicationTransport services

Reliable / UnreliableConnected / Datagram Send/Receive, RDMA, Atomic operations

OffloadingTransport executed by HCAKernel bypassRDMA

Port

VL VL VL VL…

Port

VL VL VL VL…

Transport and RDMA Offload Engine

SendQueue

ReceiveQueue

QP

SendQueue

ReceiveQueue

QP

Consumer

CompletionQueue

postingWQEs

pollingCQEs

HCA

InfiniBand

CPU CPU

Chipset

HCA

Mem

PCIe

CPU CPU

Bridge

HCA

Mem

PCIe

InfiniBand

Mem

Page 16: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 16

Partitions

Logically divide the fabric into isolated domainsPartial and full membership per partitionPartition filtering at switches

Similar to802.1Q VLANsFC Virtual Fabrics (VFs)

Host AHost B

InfiniBand fabric

Partition 1 Inter-Host

Partition 2private to host BPartition 3

private to host APartition 4shared

I/O A

I/O BI/O C

I/O D

Page 17: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 17

InfiniBand Data Integrity

Hop by hopVCRC – 16 bit CRCCRC16 0x100B

End to endICRC – 32 bit CRCCRC32 0x04C11DB7 Same CRC as Ethernet

Application levelT10/DIF Logical Block Guard

Per block CRC16 bit CRC 0x8BB7

InfiniBand Fabric

Fibre

Channel

SAN

VCRC VCRC VCRC

ICRC

T10/DIF

T10/DIF

VCRC VCRC VCRC

ICRC

Switch

Switch

Gateway

Switch

InfiniBandBlock Storage

FC Block Storage

VCRC

Page 18: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 18

High Availability and Redundancy

Multi-port HCAsCovers link failure

Redundant fabric topologiesCovers link failure

Link layer multi-pathing (LMC)Automatic Path Migration (APM)ULP High Availability

Application level multi-pathing (SRP/iSER)Teaming/Bonding (IPoIB)Covers HCA failure and link failure

Page 19: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 19

Upper Layer Protocols

ULPs connect InfiniBand to common interfacesSupported on mainstream operating systems

ClusteringMPI (Message Passing Interface)RDS (Reliable Datagram Socket)

NetworkIPoIB (IP over InfiniBand)SDP (Socket Direct Protocol)

StorageSRP (SCSI RDMA Protocol)iSER (iSCSI Extensions for RDMA)NFSoRDMA (NFS over RDMA) Hardware

Device Driver

InfiniBand Core Services

IPoIB

TCP/IP SDP RDS

socket interface

SRP iSERNFSover

RDMA

block storagefile storage

Device Driver

InfiniBand Core Services

MPI

HPC clustering

kern

el b

ypas

s

Kernel

IBApps

IBApps

ClusteringApps

sockets

Socket based Apps

Userstorage

Interfaces(file/block)

Storage Apps

Operating system InfiniBand Infrastructure Applications

Page 20: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 20

InfiniBand Block Storage Protocols

SRP - SCSI RDMA ProtocolDefined by T10

iSER – iSCSI Extensions for RDMADefined by IETF IP Storage WGInfiniBand specifics defined by IBTA (e.g. CM) Leverages iSCSI management infrastructure

Protocol offloadUse IB Reliable ConnectedRDMA for zero copy data transfer

SCSI Application

Layer

SCSI Transport Protocol

Layer

Interconnect Layer

SAM-3

FC-3 (FC-FS, FC-LS)

FC-2 (FC-FS)FC-1 (FC-FS)FC-0 (FC-PI)

SCSI Application

Layer

FC-4 Mapping (FCP-3)

Fibre Channel

InfiniBand

SCSI Application

Layer

SRP

SRP

InfiniBand /iWARP

SCSI Application

Layer

iSCSI

iSCSI

iSER

Page 21: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 21

SRP - Data Transfer Operations

Send/ReceiveCommandsResponses Task management

RDMA – Zero Copy PathData-InData-OutTarget issues the RDMA operations

iSER uses the same principlesImmediate/Unsolicited data allowed through Send/Receive

iSER and SRP are part of mainline Linux kernel

Initiator Target

Initiator Target

IO Read

IO Write

Page 22: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 22

Discovery Mechanism

SRPPersistent Information {Node_GUID:IOC_GUID}Subnet Administrator (Identify all ports with CapabilityMask.IsDM)Identifiers

Per LUN WWN (through INQUIRY VPD)SRP Target Port ID {IdentifierExt[63:0], IOC GUID[63:0]}Service Name – SRP.T10.{PortID ASCII}Service ID – Locally assigned by the IOC/IOU

iSER – uses iSCSI’s (RFC 3721)Static Configuration {IP, port, target name}Send Targets {IP, port}SLPiSNSTarget naming (RFC 3721/3980)

iSCSI Qualified Names (iqn.), IEEE EUI64 (eui.), T11 Network Address Authority (naa.)

I/O Controller

I/OController

I/O U

nit

InfiniBand I/O Model

Page 23: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 23

NFS over RDMA

Defined by IETFONC-RPC extensions for RDMANFS mapping

RPC Call/ReplySend/Receive – if smallVia RDMA Read chunk list - if big

Data transferRDMA Read/Write – described by chunk list in XDR messageSend – inline in XDR message

Uses InfiniBand Reliable Connected QPUses IP extensions to CMConnection based on {IP, port}Zero copy data transfers

NFSoRDMA is part of mainline Linux kernel

Client Server

Client Server

NFS READ

NFS WRITE

Page 24: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved.

Storage Gateways

BenefitsConnectivity of InfiniBand islands to SANI/O scales independently of computingDesign based on average load across multiple servers

Current GatewaysSRP FCiSER FCStateful architecture

Gateways FuturesFCoIB FCFCoE siblingStateless architectureScalable, high performance

24

Gateway

FibreChannel

Servers InfiniBand

scalable

IB Header

FCoIBHDR

FC HDR

FC Payload

FC CRC

FCoIBTrailer

IB CRCs

FCoIB

FC HDR

FC Payload

FC CRC

FC

Stateless Packet Relay

Page 25: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved.

What Drives Efficient Clustering?

Ultra Low LatencyKernel bypassFlexible polling vs. interrupt modelCut through switching

High Bandwidth/Message RateHCAs match server I/O available bandwidthEfficient offloaded implementationCongestion managementAdaptive routing

ScalabilityFat tree with equal bisectional bandwidthLinear routing for up to 48K LIDs

Overlapped Communication and ComputationAsynchronous interfaceRDMA – saves data copy, runs in parallel

25

Page 26: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved.

What Drives Efficient Storage Access?

Full I/O offloadZero copyInterrupt avoidance (moderated per I/O interrupt)Offloaded segmentation and reassemblyTransport reliabilityLossless fabric – credit based flow control

Fabric ConsolidationPartitions - isolationVL Arbitration – QoSHost virtualization friendlinessHigh throughputPerformance counters

26

Page 27: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 27

Performance Metrics (4x QDR)

InfiniBand Verbs (Native)Latency

RDMA Write 0.84usRDMA Read 1.78us (roundtrip)

Bandwidth3.2GB/s (unidirectional)6.5GB/s (bidirectional)

Clustering (MPI)Latency 0.97us Message rate 50M msg/sec

Block Storage (SRP)Bandwidth 1MB I/O (RAM drive)

I/O Read/Write 3.3 GB/s

Bandwidth 1MB I/O (23 drives)I/O Read 2.1GB/sI/O Write 1.8GB/s

File Storage (NFSoRDMA)Bandwidth 64KB record on 1GB fileRead 2.9GB/sWrite 0.59GB/s

Page 28: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 28

InfiniBand Storage Opportunities & Benefits

Clustering port can connect to storageHigh Bandwidth FabricFabric consolidation (QoS, partitioning)Efficiency – full offload and zero copyGateways

One wire out of the serverShared remote FC ports – scalabilityIndependent growth for I/O and computing

Network cache

Clustered/Parallel storage, Backend fabric benefits:

Combined with clustering infrastructureEfficient object/block transferAtomic operationsUltra low latency High bandwidth

Parallel / clustered file-systemParallel NFS

Server

OSD/BlockStorage Targets

Servers

InfiniBand

InfiniBandBackend

Native IBJBODs

Direct attach native IB

Block Storage

Native IBFile Server

(NFS RDMA)Native IB

Block Storage(SRP/iSER)

Servers InfiniBand

Gateway

InfiniBand Storage Deployment Alternatives

FibreChannel

Page 29: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 29

Summary

InfiniBand I/O is a great fit for the datacenterLayered implementationPrice/Performance, powerEnables efficient SAN, Network, IPC and Management traffic

InfiniBand brings true fabric consolidationGateways provide scalable connectivity to existing fabricsFabric is fully featured for consolidation (partitions, QoS, over provisioning, etc.)

Existing storage opportunities with InfiniBandE.g. connectivity to HPC clusters, where IB is the dominant fabric

Page 30: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 30

Other SNIA Tutorials

Check out SNIA Tutorials Fibre Channel over Ethernet (FCoE)Ethernet Enhancements for Storage: Deploying FCoE

Page 31: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 3131

Q&A / Feedback

Please send any questions or comments on this presentation to SNIA: [email protected]

Many thanks to the following individuals for their contributions to this tutorial.

- SNIA Education Committee

Bill Lee Gilad ShainerHoward Goldstein Joe White Ron Emerick Skip JonesWalter Dey

Page 32: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved.

Backup

Page 33: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 33

InfiniBand Resources

InfiniBand software is developed under OpenFabrics Open source Alliance http://www.openfabrics.org/index.html

InfiniBand standard is developed by the InfiniBand® Trade Association http://www.infinibandta.org/home

Page 34: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 34

Reference

InfiniBand Architecture Specification Volumes 1-2 Release 1.2.1

www.infinibandta.org

IP over InfiniBandRFCs 4391, 4392, 4390, 4755 (www.ietf.org)

NFS Direct Data Placementhttp://www.ietf.org/html.charters/nfsv4-charter.html

iSCSI Extensions for RDMA (iSER) Specificationhttp://www.ietf.org/html.charters/ips-charter.html

SCSI RDMA Protocol (SRP), DIFwww.t10.org

Page 35: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved.

InfiniBand Wire Speed Roadmap

35

Page 36: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 36

Interconnect Trends – Top500

InfiniBand powers the first PetaFlop supercomputer12,960 CPU Cores3,456 Blades

Source: http://www.top500.org/The TOP500 project was started in 1993 to provide a reliable basis for tracking and detecting trends in high-performance computing.

Efficiency

75%

51%45%

50%

55%

60%

65%

70%

75%

80%

Average Cluster Efficiency

InfiniBand GigE

Roadrunner

Page 37: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 37

Interconnect: A Competitive Advantage

Enterprise Data Centers

High-Performance Computing

Embedded

End-Users

Clustered DatabaseeCommerce and RetailFinancialSupply Chain ManagementWeb Services

Biosciences and GeosciencesComputer Automated EngineeringDigital Content CreationElectronic Design AutomationGovernment and Defense

CommunicationsComputing and Storage AggregationIndustrialMedicalMilitary

ServersAnd Blades

Embedded

Switches

Storage

Page 38: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 38

Applicable Markets for InfiniBand

Data CentersClustered database, data warehousing, shorter backups, fabric consolidation, power savings, virtualization, SOA, XTP

FinancialReal-time risk assessment, grid computing and fabric consolidation

Electronic Design Automation (EDA) and Computer Automated Design (CAD)

File system I/O is the bottleneck to shorter job run times

High Performance ComputingHigh throughput I/O to handle expanding datasets

Graphics and Video EditingHD file sizes exploding, shorter backups, real-time production

Page 39: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 39

The Need for Better I/O

Datacenter trendsMulti-core CPUsBladed architectureFabric consolidationServer virtualization & consolidationIncreasing storage demand

Better I/O is requiredHigh capacityEfficient

Low latencyCPU Offload

ScalableVirtualization friendlyHigh availabilityPerformanceLow powerTCO reduction

CPUCore

CPUCore

OSI/O

Com

pute

Nod

e

Com

pute

Nod

e

CPU CPU

OSApp App

I/O

I/OOS OS OS OS

CPUCore

Com

pute

Nod

e

CPUCore

Page 40: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 40

The InfiniBand Architecture

Industry standard defined by the InfiniBand Trade AssociationDefines System Area Network architecture

Comprehensive specification: from physical to applications

Architecture supportsHost Channel Adapters (HCA)Target Channel Adapters (TCA)SwitchesRouters

Facilitated HW design for Low latency / high bandwidthTransport offload

Rev1.0

2000 2001 2004

Rev1.0a

Rev1.2

2007

Rev1.2.1

2002

Rev1.1

… …Processor

Node

InfiniBandSubnet

Gateway

HCA

SwitchSwitch

SwitchSwitch

ProcessorNode

ProcessorNode

HCA

HCA

TCA

StorageSubsystem

Consoles

TCA

RAID

Ethernet

Gateway

FibreChannel

HCA

Subnet Manager

Page 41: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 41

InfiniBand Topologies

Example topologies commonly usedArchitecture does not limit topologyModular switches are based on fat tree architecture

Back to Back

2 Level Fat Tree

……

……

……3D Torus

Dual StarHybrid

Page 42: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 42

InfiniBand Node

ULP

NetworkLayer

LinkLayer

PhysicalLayer

TransportLayer

Application

InfiniBand Node

ULP

NetworkLayer

LinkLayer

PhysicalLayer

TransportLayer

Application

InfiniBand Protocol Layers

Packet relay

PHY

PHY

InfiniBand Router

Packet relay

PHY

Link

PHY

Link

InfiniBand Switch

Page 43: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 43

InfiniBand Packet Format

LRH GRH BTH ExtHDRs Payload ICRC VCRC

8B 40B 12B var 0..4096B 4B 2B

VL LVer SL rsvd LNH DLID

rsvd SLIDLen

LRH

IPVer

Next HeaderPayload Len

Flow LabelTClass

Hop Lim

SGID[127:96]

SGID[95:64]

SGID[63:32]

SGID[31:0]

DGID[127:96]

DGID[95:64]

DGID[63:32]

DGID[31:0]

GRH (Optional)

Partition Key

Destination QP

TVerOpcode

rsvd

PSNrsvdA

SMPad

BTH

InfiniBand Data Packet

Extended headers:•Reliable Datagram ETH (4B)•Datagram ETH (8B)•RDMA ETH (16B)•Atomic ETH (28B)•ACK ETH (4B)•Atomic ACK ETH (8B)•Immediate Data ETH (4B)•Invalidate ETH (4B)

Page 44: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 44

Management Model

Subnet Mgt Agent

Subnet Manager

Subnet Management Interface

QP0 (uses VL15)QP1

Baseboard Mgt Agent

Communication Mgt Agent

Performance Mgt Agent

Device Mgt Agent

Vendor-Specific Agent

Application-Specific Agent

SNMP Tunneling Agent

Subnet Administration

General Service Interface

Subnet Manager (SM)Configures/Administers fabric topologyImplemented at an end-node or a switchActive/Passive model when more than one SM is presentTalks with SM Agents in nodes/switches

Subnet AdministrationProvides path recordsQoS management

Communication ManagementConnection establishment processing

Page 45: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 45

Block Storage Data Transfer Summary

SRP iSER iSCSI FCP

Request SRP_CMD (SEND) SCSI-Command (SEND) SCSI-Command FCP_CMND

Response SRP_RSP (SEND) SCSI-Response (SEND) SCSI-Response(or piggybacked on Data-In PDU)

FCP_RSP

Data-In Delivery RDMA Write RDMA Write Data-In FCP_DATA

Data-Out Delivery RDMA ReadRDMA Read Resp.

RDMA ReadRDMA Read Resp.

R2TData-Out

FCP_XFER_RDYFCP_DATA

Unsolicited Data-Out Delivery

Part of SCSI-Command (SEND)Data-Out (SEND)

Part of SCSI-CommandData-Out

FCP_DATA

Task Management SRP_TSK_MGMT (SEND)

Task Management Function Request/Response (SEND)

Task Management Function Request/Response

FCP_CMND

Page 46: Fabric Consolidation With InfiniBand · Network Layer. Global Identifier (GID) addressing. IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID} GUID = Global Unique Identifier

Fabric Consolidation with InfiniBand © 2009 Storage Networking Industry Association. All Rights Reserved. 46

Glossary

APM - Automatic Path MigrationBECN - Backward Explicit Congestion NotificationBTH - Base Transport HeaderCFM - Configuration ManagerCQ - Completion QueueCQE - Completion Queue ElementCRC - Cyclic Redundancy CheckDDR - Double Data RateDIF - Data Integrity FieldFC - Fibre ChannelFECN - Forward Explicit Congestion NotificationGbE - Gigabit EthernetGID - Global IDentifierGRH - Global Routing HeaderGUID - Globally Unique IDentifierHCA - Host Channel AdapterIB - InfiniBandIBTA - InfiniBand Trade AssociationICRC - Invariant CRCIPoIB - Internet Protocol Over InfiniBandIPv6 - Internet Protocol Version 6iSER - iSCSI Extensions for RDMALID - Local IDentifierLMC - Link Mask ControlLRH - Local Routing HeaderLUN - Logical Unit Number

MPI - Message Passing InterfaceMR - Memory RegionNFSoRDMA - NFS over RDMAOSD - Object based Storage DeviceOS - Operating SystemPCIe - PCI ExpressPD - Protection DomainQDR - Quadruple Data RateQoS - Quality of ServiceQP - Queue PairRDMA - Remote DMARDS - Reliable Datagram SocketRPC - Remote Procedure CallSAN - Storage Area NetworkSDP - Sockets Direct ProtocolSDR - Single Data RateSL - Service LevelSM - Subnet ManagerSRP - SCSI RDMA ProtocolTCA - Target Channel AdapterULP - Upper Layer ProtocolVCRC - Variant CRCVL - Virtual LaneWQE - Work Queue ElementWRR - Weighted Round Robin