Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for...

33

Transcript of Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for...

Page 1: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric
Page 2: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Single-Stage Fabric &

Lossless Ethernet for

Network Consolidation

Nick IIyadis

VP and CTO

Infrastructure and Networking Group

Page 3: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

About Broadcom

• Global leader in semiconductors for wired and wireless communications

• Founded 1991

• Fortune 500 company – Ranked most innovative semiconductor

company

• 2010 Gartner Top 10 Semiconductor Companies (Revenue)

– 2010 net revenue of $6.82 billion

• One of the largest volume fabless semiconductor suppliers

• Broad IP portfolio with over 15,000 U.S. and foreign patents and applications

– Strongest patent portfolio among fabless semiconductor companies (IEEE)

• Approximately 9,052 employees worldwide

Driving Convergence Through Unique & Innovative Product Offerings

Page 4: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Core

Switches

Broadcom Delivers

Complete Data Center

Network Solutions for

Virtualization and

Convergence:

Access

Switches

Top of Rack Blade

• Highly Scalable Architectures

• Comprehensive Virtualization

• Converged I/O: FCoE & iSCSI

• Lossless DCB Networking

• End-End Low Latency & QoS

• High Density PHY/SerDes

• Low Power - Energy Efficient

• Reliable & Secure Transport

Servers

Broadcom Data Center Solutions

Aggregation

Switches

End of Row

STORAGE NETWORK

Broadcom Powered

Data Center

DATA NETWORK

Page 5: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Broadcom Observed Network Challenges Building Large Distributed Computing Systems

• Scalability – Traditional network architectures won’t scale efficiently

– Minimal technology layers are desired – management

– Lifecycle is defined by own needs (typically network capacity) vs. a traditional Intel CPU cycle

– Maintaining security and isolation at scale

• Performance – Congestion

– Throughput

– Low Latency

– High availability / reliability

• Cost – OPEX and CAPEX – extreme efficiency and maximum utilization

– Leverage Open Industry Standard Management Interfaces

– Leverage from silicon integration trends in networking

– Convergence into single networking technology

• Power – Every watt is looked after – @ Scale it adds up fast

– Energy usage tied to utilization vs. always on

Page 6: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Data Center Network Evolution

C0-0 C0-1 C0-31 C0-0 C0-1

B0-63 B0-0 B0-1

32x10GE 32x10GE 32x10GE

A-0 A-2047

Modular, horizontal scaling, full cross-sectional bandwidth

with no locality for apps or storage

Virtual Machines and Mobility

Any Application on Any Server

Multi-tenant Data Centers in Clouds

32x10GE

C0-0 C0-1 C0-31 C0-0 C0-1

B0-63 B0-0 B0-1

32x10GE 32x10GE 32x10GE

C0-0 C0-1 C0-31 C0-0 C0-1

B0-63 B0-0 B0-1

32x10GE 32x10GE 32x10GE

Designed for north-south traffic and N-tiered data center

Physical Static Servers

Tiers of Servers

Single Tenant Data Center

32x10GE

Page 7: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Flat vs. Single Stage Fabrics

• Flat fabrics utilize independent switching elements – Each element is individually managed

– Cross sectional bandwidth is achieved through multi-pathing to next stage

• Single stage fabrics act as a single forwarding domain – Each element is a component of the whole fabric – single

management domain

– Cross sectional bandwidth is achieved through multi-pathing to next stage

Page 8: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

One-Tier Networking Architecture Concept

Entire Network Behaves Like A Single L2 Non-blocking Switch (1-hop)

One Large,

Modular,

Non-blocking

Switch

Non-blocking and fair

any-to-any connectivity

Modular reliability

and scalability

Linear cost & power scaling (pay

as you grow)

Single packet processor and

TM domain

Page 9: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Flat vs. Single-Stage Fabric for Data Center

1 0 G E S e r v e r s

Ethernet Fabric

Rack Server Blade Server Blade Server Rack Server

V M V M V M V M

Core Chassis Core Chassis Core Chassis Core Chassis

Top of Rack Top of Rack

Independent Forwarding

Independent Forwarding

Independent Forwarding

Independent Forwarding

Independent Forwarding

V M V M V M V M

Single Forwarding Plane

Page 10: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Myth Buster

• Flat or Single-Stage fabrics can have multiple “levels” of switching elements

• The multiple levels allow higher radix of access ports or port span

• The important attribute is that high cross-sectional bandwidth is maintained

• So we end up with “Multi-stage” single stage or flat fabrics

• Confused yet?

Page 11: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Packet vs. Cell Fabrics

• Packet Fabrics – Ingress/Egress ports are Packet-based

– Fabric interconnect is Ethernet, or Ethernet with Fabric Header

– Multi-path through SPB, Trill or Fabric Header

– Links are load-balanced through some traffic spreading algorithm

• Cell Fabrics – Ingress/Egress ports are Packet-based

– Fabric interconnect is Cell-based

– Traffic is spread across links in a round-robin fashion or through dynamic load balancing

Page 12: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Flat Fabric Interconnects

• These are packet-based

• TRILL= TRansparent Interconnection of Lots of Links

– Encapsulate Layer 2 traffic in a Layer 3 Packet in RBridge

– RBridge acts like a Layer 2 bridge on the ingress and egress

– Internal to fabric traffic is forwarded based on Layer 3 header

– Layer 3 multi-pathing used to spread traffic

• PBB – Provider Backbone Bridging

– Mac in Mac based encapsulation to allow Layer 2 traffic to be transported on Layer 2 multipath interconnect

– PBB traffic load balancing is pre-provisioned

Page 13: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Single Stage Fabric Interconnects

• Packet-Based with Fabric Header

– Ethernet packet is pre-pended with Fabric Header

– Packets get inspected only once as they ingress edge switch

– Egress processing and path selection is included in fabric header

– Fabric header is used for all internal fabric forwarding

– Load balancing is performed via intra-fabric congestion notification

– On egress fabric header is stripped and egress processing is performed

– Broadcom HiGig is an example of this type of interconnect

HiGig Header Payload Packet FCS

FCS covers Ethernet Frame + HiGig Header

Page 14: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Single Stage Fabric Interconnects

• Cell-Based

– Each cell is a portion of the original packet

– Cell is pre-pended with Fabric Header

– Packets get inspected only once as they ingress edge switch

– Egress processing and path selection is included in First Cell header

– Cell header is used for all internal fabric forwarding

– Load balancing is performed via intra-fabric congestion notification

– On-egress Cells are reassembled and egress processing is performed

– Broadcom Dune Fabric is cell-based

Overall Payload Packet

Cell Hdr-1 Payload

Packet-C1 FCS Cell Hdr-2

Payload

Packet-C2 FCS Cell Hdr-n

Payload

Packet-Cn FCS

Page 15: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Broadcom Switching Products for DC Fabrics

Massive Scaling with

Non-Blocking Performance

& Low Latency

Highest 10/40/100GbE port density per chip

Standalone and chassis solutions

Service-Oriented,

Deterministic

Performance

Service level-aware packet buffer and load handling

Fabric-wide elimination of congestion and HOL blocking

Flat 1-Tier Topologies

for Scaling Simplicity

Multipathing, CLOS-based any-to-any connectivity

Manageable as a single logical switching device

Aggregation

Switch

Top of Rack Switch

Rack Servers

Complete End-to-End Data Center Switch Portfolio with StrataXGS® & DUNE

Page 16: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Desirable Fabric Features

• Scalable total capacity – 100G to 200Tbps+ – In-service & non-traffic affecting upgrade capable

• Scalable port rate – 10G, 40G, 100G – Must be clear channel (i.e., no hashing schemes allowed)

– Future consideration 400G & 1Tbps

• Strictly Non-Blocking & Low-Delay – Work conserving operation under any

traffic pattern

– Full bisectional bandwidth • Bus, Tree, Hypercube, Torus, Butterfly

interconnects don’t fit

• CLOS, Fat-Trees, full-mesh interconnects may fit

• Resiliency – Fast recovery from link and/or device/s failure

– Ideally automatic HW based failover

• Backward & Forward Compatibility – At least one generation of investment

protection

CLOS (folded)

2x2

4x4

4x4

FAT Tree

Page 17: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Fabric Features Continued

• Fabric is used to carry asynchronous traffic

• Asynchronous traffic possible congestion buffering

• Buffering Traffic Management

• Traffic Management is a System-Level problem

– Distributed across 100’s of devices

• Traffic Management means

– Guarantee minimum rates per interface – Committed Information Rate (CIR)

– Distribution of leftover bandwidth per interface – Excess Information Rate (EIR)

– Deep buffering – at ultra high rate, sometimes must be in DRAM

– Smart buffering and end to end flow control

– Flow control – to enable lossless operation

• Link-Level – 802.3x

• Link-priority-level – 802.3Qbb

• Host-level – 802.1ag

• Flow-level – 802.1ag

• Traffic Management is an integral part of the fabric solution

Page 18: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Broadcom HiGig Packet Fabric

• Multiple switches act as ONE logical switch – Ease of management as one entity

– Logical extension of ports

• Intelligent congestion avoidance – Intelligent path selection

– Destination module VoQ

• Trunking / teaming of ports – Multiple ports on different systems function as one

logical port

• Increased throughput per port – Over-clocking of HiGig ports up to 20G+ per port

• Faster response times to flow control – Pre-emptive Service Aware Flow Control (SAFC)

– Less buffering required for lossless behavior

VSwitch

VM 1 VM 2 VM n

HiGig Domain

One Logical Switch

Higig Widely Deployed in the Industry with Over 100+ Million Ports Shipped

EoR

ToR

Page 19: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

SAND for Scalable Switching

Medium Systems (Single Stage 10 Tbps)

Small Systems (Fabric-less 240 Gbps)

Large Systems (Multistage 100 Tbps)

Switching Card 1

Switching Card 16

Line Card 1 – 320G

FE600

/[email protected]

FE600

/1FE600

FE600

/1

/1

P230

P230

P230

P230

10G Phy

10G Phy

...

10G Phy

10G Phy

...

10G Phy

10G Phy

...

10G Phy

10G Phy

...

Line Card n – 320G

P330

P330

P330

P330

10G Phy

10G Phy

...10G Phy

10G Phy

...

10G Phy

10G Phy

...

10G Phy

10G Phy

...

/[email protected]

/1

/1/1

PP

PP

PP

PP

PP

PP

PP

PP

Small systems(fabric-less 240Gbps)

Medium systems (Single stage 10Tbps)

Large systems(Multistage 100Tbps)

P330

Packet

SDRAM

Link

QDR

P330

Packet

SDRAM

Link

QDR

P330

Packet

SDRAM

Link

QDR

P330

P330P330

P330

P330

P330

P330

P330

XAUI

XA

UI

XA

UI

XA

UI

XAUI

XA

UI

XAUI

XAUI

Line Card n – 320G

P330

P330

P330

P330

10G Phy

10G Phy

...

10G Phy

10G Phy

...

10G Phy

10G Phy

...

10G Phy

10G Phy

...

Line Card 1 – 320G

P230

P230

P230

P230

10G Phy

10G Phy

...

10G Phy

10G Phy

...

10G Phy

10G Phy

...

10G Phy

10G Phy

...

PP

PP

PP

PP

PP

PP

PP

PP

Fabric Plane k

FE13

FE13

FE2

FE2

Fabric Plane 1

FE13

FE13

FE2

FE2

/1

/[email protected]

/[email protected]

/1

Page 20: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

SAND Modular Chassis Configuration

L I N E C A R D

L I N E C A R D

Page 21: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Dynamic Routing Fabric Technology

1

2

3

4

5

6

7

8

9

1

2

3

4

5

6

7

8

9

1

2

3

4

5

6

7

8

9

1

2

3

4

5

6

7

8

9

Dynamic Routing Load based load balancing

Static Routing (hash based load balancing)

• What is Clos? – A multistage interconnect

– At every stage - each element is connected to all elements of next stage

• Characteristics – Multiple routes from input to middle

stage

– Single route to destination from middle stage to output

– Recursively scalable

– Re-arrangeably non-blocking

• All CLOS are not created equal – Static Routing vs. Dynamic Routing

Page 22: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

• CLOS network becomes strictly non-blocking

• Optimal utilization of fabric capacity regardless of traffic pattern

• Fabric scalability

• Pipe/Port rate scalability

• Elegant fault tolerance scheme

• Key challenge is cell reordering – Scalable

– Reliable

– Low Cost

CLOS Fabric Technology Advantage

Petra #1

Petra

Petra #n

Petra

Petra #2

n x n Plane #1

K x K #1

Petra

K x K #n

Petra

K x K #2

Petra Petra

FE600

n x n Plane #k

FE600

m x m Plane #t

K Fabric links per Petra; n<=96, MS fabric: m>96

SAND CLOS Implementation

Page 23: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

USE CASES WITH SANS

Page 24: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Start from an Oversubscribed Usage

Servers

Access Switch

Aggregation Switch

Horizontal Cables

(uplinks)

Page 25: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Pay As You Go to Non-Blocking

Distributed Aggregation

Page 26: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

• Identical gear for network/storage spines, could decide at deployment time which is which

• Leafs come in compute/storage flavors, identical topology attributes and hops

Absorb a Converged Network Over

Distributed Aggregation

Storage Leaf

Converged Storage Spine

Page 27: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

If You Have an FC SAN to Grandfather

• Two options (only #2 is shown)

– Branch out at every access switch towards existing SAN

– Leave topology alone and branch out from SAN gateway leaf(s)

• Can still provisioned dedicated storage spines

FC or FCOE Spigots to SAN

Page 28: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Congestion Notification Concept

• Congestion Point (CP) – Port/Priority Queue supporting CN – Detect congestion, and generate CNM back to traffic source

• Reaction Point (RP) –Traffic Source (e.g. Server, NIC) supporting CN – Terminate CNM, extract flow-ID and reduce its rate according to congestion level (Feedback)

– Includes also CP functionality

• Congestion Notification Message (CNM) – Congestion indication packet sent from CP to RP

Congestion

Data A=>C Data A=>C

Dat

a B=>C

Data A=>C

CNM

to B

CNM to B

CNM to A

CNM to A

Reaction

Point

B

Reaction

Point

A

Data B=>C

Reaction

Point

C

CP Q

Data B=>C

Flow 1

Flow 2

Flow N

Data

Out

Extract Flow &

Update Rate

CNM

Reaction Point Functionality

Page 29: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Congestion Notification

• Objectives - Avoid packet loss and reduce network latency

• Principals – Detect congestion points in the network (per port and priority)

• Monitor instantaneous CP queue size, and change in CP queue size

– Signal congestion back to traffic sources (RP) using CNM packets

• Rate of CNM packets is limited – Define maximum sampling rate

– Traffic sources reduce rate of flows targeted to the same congestion point

• Rate is increased slowly in the absence of CNM packets

• Defined for Layer 2 networks – CNM packets are sent back to traffic source, using MAC-SA from

sampled packet

• Applied per priority – CNM are sent only on traffic in a CN priority

• Can be used in combination with PFC – PFC prevents drops during the onset of congestion

– CN slows sources of congestion to limit the duration of congestion spreading

Page 30: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Congestion Point Functionality

• Qeq (Qsp) – Queue Set Point: Target number of octets in the CP queue

• Qoffset – Queue length offset from Queue Set Point

• OFb – Feedback: The congestion level (more negative => higher congestion)

• Delta – Queue length change in last sampling period

• SampleBase – Min number of octets to enqueue to CP-Q between two generated CNM packets

• Enqueued – Number of Octets remaining to be enqueued to CP-Q before next CNM can be generated

Packet IN

QeqQueue Full

CP

Queue

?

YesQoffset = Qeq - Qlen

Qdelta = Qlen - Qlen-oldFb = Qoffset – W*Qdelta

Enqueued -= packet-len

Enqueud

==0

Qold = Qlen

Enqueued = SampleBase(Fb)

Add +/- 15% random factor

Yes

Fb < 0Yes

Send CNM

Packet Out

Page 31: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Congestion Notification Message Format

• CNM Packet is sent back to source of sampled packet – CNM MAC-DA <= sampled packet MAC-SA

• RP extracts Flow-ID (in CN-TAG) – Which flow/s require rate adjustment

• RP extracts QntzFb – Defines the congestion severity

SAMPLED

PACKET DA

6B

SA

6B

VLAN

(optional)

4B

CN-TAG

(flow-ID)

4B

Payload

DA

6B

SA

6B

CN-TAG

(flow-ID)

4B

CNM Eth-type

2B

CNM PDU

24B

Copied Conf Generated Copied Conf Generated

VLAN- Tag

(optional)

4B

Sampled Packet Payload

(optional)

Up to 64 bytes

Copied

CNM PDU Version

4 bits

Reserved

6 bits

CPID

8 bits

Qoffset

2B

Qdelta

24B

Conf Conf Generated Generated Generated Generated

QntzFb

6 bits

Sam. Pkt priority

2B

copied

Sam. Pkt MAC-DA

6B

copied

Encap. Pkt Len

2B

Conf

GENERATED CNM PACKET

Page 32: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric

Conclusion

• Flat or single-stage fabrics are the enablers for next generation data centers

• Compute and storage can co-exist on these fabrics

• Single-stage fabrics are optimal for supporting FCoE as congestion can be managed most-effecttively

Page 33: Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for Network Consolidation ... Core Chassis Core Chassis Core Chassis Core Chassis ... FE2 Fabric