Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for...
Transcript of Single-Stage Fabric & Lossless Ethernet for Network ... · PDF fileLossless Ethernet for...
Single-Stage Fabric &
Lossless Ethernet for
Network Consolidation
Nick IIyadis
VP and CTO
Infrastructure and Networking Group
About Broadcom
• Global leader in semiconductors for wired and wireless communications
• Founded 1991
• Fortune 500 company – Ranked most innovative semiconductor
company
• 2010 Gartner Top 10 Semiconductor Companies (Revenue)
– 2010 net revenue of $6.82 billion
• One of the largest volume fabless semiconductor suppliers
• Broad IP portfolio with over 15,000 U.S. and foreign patents and applications
– Strongest patent portfolio among fabless semiconductor companies (IEEE)
• Approximately 9,052 employees worldwide
Driving Convergence Through Unique & Innovative Product Offerings
Core
Switches
Broadcom Delivers
Complete Data Center
Network Solutions for
Virtualization and
Convergence:
Access
Switches
Top of Rack Blade
• Highly Scalable Architectures
• Comprehensive Virtualization
• Converged I/O: FCoE & iSCSI
• Lossless DCB Networking
• End-End Low Latency & QoS
• High Density PHY/SerDes
• Low Power - Energy Efficient
• Reliable & Secure Transport
Servers
Broadcom Data Center Solutions
Aggregation
Switches
End of Row
STORAGE NETWORK
Broadcom Powered
Data Center
DATA NETWORK
Broadcom Observed Network Challenges Building Large Distributed Computing Systems
• Scalability – Traditional network architectures won’t scale efficiently
– Minimal technology layers are desired – management
– Lifecycle is defined by own needs (typically network capacity) vs. a traditional Intel CPU cycle
– Maintaining security and isolation at scale
• Performance – Congestion
– Throughput
– Low Latency
– High availability / reliability
• Cost – OPEX and CAPEX – extreme efficiency and maximum utilization
– Leverage Open Industry Standard Management Interfaces
– Leverage from silicon integration trends in networking
– Convergence into single networking technology
• Power – Every watt is looked after – @ Scale it adds up fast
– Energy usage tied to utilization vs. always on
Data Center Network Evolution
C0-0 C0-1 C0-31 C0-0 C0-1
B0-63 B0-0 B0-1
32x10GE 32x10GE 32x10GE
A-0 A-2047
Modular, horizontal scaling, full cross-sectional bandwidth
with no locality for apps or storage
Virtual Machines and Mobility
Any Application on Any Server
Multi-tenant Data Centers in Clouds
32x10GE
C0-0 C0-1 C0-31 C0-0 C0-1
B0-63 B0-0 B0-1
32x10GE 32x10GE 32x10GE
C0-0 C0-1 C0-31 C0-0 C0-1
B0-63 B0-0 B0-1
32x10GE 32x10GE 32x10GE
Designed for north-south traffic and N-tiered data center
Physical Static Servers
Tiers of Servers
Single Tenant Data Center
32x10GE
Flat vs. Single Stage Fabrics
• Flat fabrics utilize independent switching elements – Each element is individually managed
– Cross sectional bandwidth is achieved through multi-pathing to next stage
• Single stage fabrics act as a single forwarding domain – Each element is a component of the whole fabric – single
management domain
– Cross sectional bandwidth is achieved through multi-pathing to next stage
One-Tier Networking Architecture Concept
Entire Network Behaves Like A Single L2 Non-blocking Switch (1-hop)
One Large,
Modular,
Non-blocking
Switch
Non-blocking and fair
any-to-any connectivity
Modular reliability
and scalability
Linear cost & power scaling (pay
as you grow)
Single packet processor and
TM domain
Flat vs. Single-Stage Fabric for Data Center
1 0 G E S e r v e r s
Ethernet Fabric
Rack Server Blade Server Blade Server Rack Server
V M V M V M V M
Core Chassis Core Chassis Core Chassis Core Chassis
Top of Rack Top of Rack
Independent Forwarding
Independent Forwarding
Independent Forwarding
Independent Forwarding
Independent Forwarding
V M V M V M V M
Single Forwarding Plane
Myth Buster
• Flat or Single-Stage fabrics can have multiple “levels” of switching elements
• The multiple levels allow higher radix of access ports or port span
• The important attribute is that high cross-sectional bandwidth is maintained
• So we end up with “Multi-stage” single stage or flat fabrics
• Confused yet?
Packet vs. Cell Fabrics
• Packet Fabrics – Ingress/Egress ports are Packet-based
– Fabric interconnect is Ethernet, or Ethernet with Fabric Header
– Multi-path through SPB, Trill or Fabric Header
– Links are load-balanced through some traffic spreading algorithm
• Cell Fabrics – Ingress/Egress ports are Packet-based
– Fabric interconnect is Cell-based
– Traffic is spread across links in a round-robin fashion or through dynamic load balancing
Flat Fabric Interconnects
• These are packet-based
• TRILL= TRansparent Interconnection of Lots of Links
– Encapsulate Layer 2 traffic in a Layer 3 Packet in RBridge
– RBridge acts like a Layer 2 bridge on the ingress and egress
– Internal to fabric traffic is forwarded based on Layer 3 header
– Layer 3 multi-pathing used to spread traffic
• PBB – Provider Backbone Bridging
– Mac in Mac based encapsulation to allow Layer 2 traffic to be transported on Layer 2 multipath interconnect
– PBB traffic load balancing is pre-provisioned
Single Stage Fabric Interconnects
• Packet-Based with Fabric Header
– Ethernet packet is pre-pended with Fabric Header
– Packets get inspected only once as they ingress edge switch
– Egress processing and path selection is included in fabric header
– Fabric header is used for all internal fabric forwarding
– Load balancing is performed via intra-fabric congestion notification
– On egress fabric header is stripped and egress processing is performed
– Broadcom HiGig is an example of this type of interconnect
HiGig Header Payload Packet FCS
FCS covers Ethernet Frame + HiGig Header
Single Stage Fabric Interconnects
• Cell-Based
– Each cell is a portion of the original packet
– Cell is pre-pended with Fabric Header
– Packets get inspected only once as they ingress edge switch
– Egress processing and path selection is included in First Cell header
– Cell header is used for all internal fabric forwarding
– Load balancing is performed via intra-fabric congestion notification
– On-egress Cells are reassembled and egress processing is performed
– Broadcom Dune Fabric is cell-based
Overall Payload Packet
Cell Hdr-1 Payload
Packet-C1 FCS Cell Hdr-2
Payload
Packet-C2 FCS Cell Hdr-n
Payload
Packet-Cn FCS
Broadcom Switching Products for DC Fabrics
Massive Scaling with
Non-Blocking Performance
& Low Latency
Highest 10/40/100GbE port density per chip
Standalone and chassis solutions
Service-Oriented,
Deterministic
Performance
Service level-aware packet buffer and load handling
Fabric-wide elimination of congestion and HOL blocking
Flat 1-Tier Topologies
for Scaling Simplicity
Multipathing, CLOS-based any-to-any connectivity
Manageable as a single logical switching device
Aggregation
Switch
Top of Rack Switch
Rack Servers
Complete End-to-End Data Center Switch Portfolio with StrataXGS® & DUNE
Desirable Fabric Features
• Scalable total capacity – 100G to 200Tbps+ – In-service & non-traffic affecting upgrade capable
• Scalable port rate – 10G, 40G, 100G – Must be clear channel (i.e., no hashing schemes allowed)
– Future consideration 400G & 1Tbps
• Strictly Non-Blocking & Low-Delay – Work conserving operation under any
traffic pattern
– Full bisectional bandwidth • Bus, Tree, Hypercube, Torus, Butterfly
interconnects don’t fit
• CLOS, Fat-Trees, full-mesh interconnects may fit
• Resiliency – Fast recovery from link and/or device/s failure
– Ideally automatic HW based failover
• Backward & Forward Compatibility – At least one generation of investment
protection
CLOS (folded)
2x2
4x4
4x4
FAT Tree
Fabric Features Continued
• Fabric is used to carry asynchronous traffic
• Asynchronous traffic possible congestion buffering
• Buffering Traffic Management
• Traffic Management is a System-Level problem
– Distributed across 100’s of devices
• Traffic Management means
– Guarantee minimum rates per interface – Committed Information Rate (CIR)
– Distribution of leftover bandwidth per interface – Excess Information Rate (EIR)
– Deep buffering – at ultra high rate, sometimes must be in DRAM
– Smart buffering and end to end flow control
– Flow control – to enable lossless operation
• Link-Level – 802.3x
• Link-priority-level – 802.3Qbb
• Host-level – 802.1ag
• Flow-level – 802.1ag
• Traffic Management is an integral part of the fabric solution
Broadcom HiGig Packet Fabric
• Multiple switches act as ONE logical switch – Ease of management as one entity
– Logical extension of ports
• Intelligent congestion avoidance – Intelligent path selection
– Destination module VoQ
• Trunking / teaming of ports – Multiple ports on different systems function as one
logical port
• Increased throughput per port – Over-clocking of HiGig ports up to 20G+ per port
• Faster response times to flow control – Pre-emptive Service Aware Flow Control (SAFC)
– Less buffering required for lossless behavior
VSwitch
VM 1 VM 2 VM n
HiGig Domain
One Logical Switch
Higig Widely Deployed in the Industry with Over 100+ Million Ports Shipped
EoR
ToR
SAND for Scalable Switching
Medium Systems (Single Stage 10 Tbps)
Small Systems (Fabric-less 240 Gbps)
Large Systems (Multistage 100 Tbps)
Switching Card 1
Switching Card 16
Line Card 1 – 320G
FE600
FE600
/1FE600
FE600
/1
/1
P230
P230
P230
P230
10G Phy
10G Phy
...
10G Phy
10G Phy
...
10G Phy
10G Phy
...
10G Phy
10G Phy
...
Line Card n – 320G
P330
P330
P330
P330
10G Phy
10G Phy
...10G Phy
10G Phy
...
10G Phy
10G Phy
...
10G Phy
10G Phy
...
/1
/1/1
PP
PP
PP
PP
PP
PP
PP
PP
Small systems(fabric-less 240Gbps)
Medium systems (Single stage 10Tbps)
Large systems(Multistage 100Tbps)
P330
Packet
SDRAM
Link
QDR
P330
Packet
SDRAM
Link
QDR
P330
Packet
SDRAM
Link
QDR
P330
P330P330
P330
P330
P330
P330
P330
XAUI
XA
UI
XA
UI
XA
UI
XAUI
XA
UI
XAUI
XAUI
Line Card n – 320G
P330
P330
P330
P330
10G Phy
10G Phy
...
10G Phy
10G Phy
...
10G Phy
10G Phy
...
10G Phy
10G Phy
...
Line Card 1 – 320G
P230
P230
P230
P230
10G Phy
10G Phy
...
10G Phy
10G Phy
...
10G Phy
10G Phy
...
10G Phy
10G Phy
...
PP
PP
PP
PP
PP
PP
PP
PP
Fabric Plane k
FE13
FE13
FE2
FE2
Fabric Plane 1
FE13
FE13
FE2
FE2
/1
/1
SAND Modular Chassis Configuration
L I N E C A R D
L I N E C A R D
Dynamic Routing Fabric Technology
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
Dynamic Routing Load based load balancing
Static Routing (hash based load balancing)
• What is Clos? – A multistage interconnect
– At every stage - each element is connected to all elements of next stage
• Characteristics – Multiple routes from input to middle
stage
– Single route to destination from middle stage to output
– Recursively scalable
– Re-arrangeably non-blocking
• All CLOS are not created equal – Static Routing vs. Dynamic Routing
• CLOS network becomes strictly non-blocking
• Optimal utilization of fabric capacity regardless of traffic pattern
• Fabric scalability
• Pipe/Port rate scalability
• Elegant fault tolerance scheme
• Key challenge is cell reordering – Scalable
– Reliable
– Low Cost
CLOS Fabric Technology Advantage
Petra #1
Petra
Petra #n
Petra
Petra #2
n x n Plane #1
K x K #1
Petra
K x K #n
Petra
K x K #2
Petra Petra
FE600
n x n Plane #k
FE600
m x m Plane #t
K Fabric links per Petra; n<=96, MS fabric: m>96
SAND CLOS Implementation
USE CASES WITH SANS
Start from an Oversubscribed Usage
Servers
Access Switch
Aggregation Switch
Horizontal Cables
(uplinks)
Pay As You Go to Non-Blocking
Distributed Aggregation
• Identical gear for network/storage spines, could decide at deployment time which is which
• Leafs come in compute/storage flavors, identical topology attributes and hops
Absorb a Converged Network Over
Distributed Aggregation
Storage Leaf
Converged Storage Spine
If You Have an FC SAN to Grandfather
• Two options (only #2 is shown)
– Branch out at every access switch towards existing SAN
– Leave topology alone and branch out from SAN gateway leaf(s)
• Can still provisioned dedicated storage spines
FC or FCOE Spigots to SAN
Congestion Notification Concept
• Congestion Point (CP) – Port/Priority Queue supporting CN – Detect congestion, and generate CNM back to traffic source
• Reaction Point (RP) –Traffic Source (e.g. Server, NIC) supporting CN – Terminate CNM, extract flow-ID and reduce its rate according to congestion level (Feedback)
– Includes also CP functionality
• Congestion Notification Message (CNM) – Congestion indication packet sent from CP to RP
Congestion
Data A=>C Data A=>C
Dat
a B=>C
Data A=>C
CNM
to B
CNM to B
CNM to A
CNM to A
Reaction
Point
B
Reaction
Point
A
Data B=>C
Reaction
Point
C
CP Q
Data B=>C
Flow 1
Flow 2
Flow N
Data
Out
Extract Flow &
Update Rate
CNM
Reaction Point Functionality
Congestion Notification
• Objectives - Avoid packet loss and reduce network latency
• Principals – Detect congestion points in the network (per port and priority)
• Monitor instantaneous CP queue size, and change in CP queue size
– Signal congestion back to traffic sources (RP) using CNM packets
• Rate of CNM packets is limited – Define maximum sampling rate
– Traffic sources reduce rate of flows targeted to the same congestion point
• Rate is increased slowly in the absence of CNM packets
• Defined for Layer 2 networks – CNM packets are sent back to traffic source, using MAC-SA from
sampled packet
• Applied per priority – CNM are sent only on traffic in a CN priority
• Can be used in combination with PFC – PFC prevents drops during the onset of congestion
– CN slows sources of congestion to limit the duration of congestion spreading
Congestion Point Functionality
• Qeq (Qsp) – Queue Set Point: Target number of octets in the CP queue
• Qoffset – Queue length offset from Queue Set Point
• OFb – Feedback: The congestion level (more negative => higher congestion)
• Delta – Queue length change in last sampling period
• SampleBase – Min number of octets to enqueue to CP-Q between two generated CNM packets
• Enqueued – Number of Octets remaining to be enqueued to CP-Q before next CNM can be generated
Packet IN
QeqQueue Full
CP
Queue
?
YesQoffset = Qeq - Qlen
Qdelta = Qlen - Qlen-oldFb = Qoffset – W*Qdelta
Enqueued -= packet-len
Enqueud
==0
Qold = Qlen
Enqueued = SampleBase(Fb)
Add +/- 15% random factor
Yes
Fb < 0Yes
Send CNM
Packet Out
Congestion Notification Message Format
• CNM Packet is sent back to source of sampled packet – CNM MAC-DA <= sampled packet MAC-SA
• RP extracts Flow-ID (in CN-TAG) – Which flow/s require rate adjustment
• RP extracts QntzFb – Defines the congestion severity
SAMPLED
PACKET DA
6B
SA
6B
VLAN
(optional)
4B
CN-TAG
(flow-ID)
4B
Payload
DA
6B
SA
6B
CN-TAG
(flow-ID)
4B
CNM Eth-type
2B
CNM PDU
24B
Copied Conf Generated Copied Conf Generated
VLAN- Tag
(optional)
4B
Sampled Packet Payload
(optional)
Up to 64 bytes
Copied
CNM PDU Version
4 bits
Reserved
6 bits
CPID
8 bits
Qoffset
2B
Qdelta
24B
Conf Conf Generated Generated Generated Generated
QntzFb
6 bits
Sam. Pkt priority
2B
copied
Sam. Pkt MAC-DA
6B
copied
Encap. Pkt Len
2B
Conf
GENERATED CNM PACKET
Conclusion
• Flat or single-stage fabrics are the enablers for next generation data centers
• Compute and storage can co-exist on these fabrics
• Single-stage fabrics are optimal for supporting FCoE as congestion can be managed most-effecttively