John DeHart and Mike Wilson SPP V2 Router Design.

57
John DeHart and Mike Wilson SPP V2 Router Design

Transcript of John DeHart and Mike Wilson SPP V2 Router Design.

John DeHart and Mike Wilson

SPP V2 RouterDesign

2 - Mike Wilson - 04/19/23

Revision History3 June 2008

»Initial release, presentation25 June 2008

»Updates on feedback from presentation

3 - Mike Wilson - 04/19/23

SPP VersionsSPP Version 0:

»What we used for SIGCOMM PaperSPP Version 1:

»Bare minimum we would need to release something to PlanetLab Users

SPP Version 2:»What we would REALLY like to release to PlanetLab users.

4 - Mike Wilson - 04/19/23

Objectives for SPP-NPE version 2 Deal with constraints imposed by switch

»can send to only one NPU; can receive from only one NPU»split processing across NPUs

parsing, lookup on one; queueing on other Provide more resources for slice-specific processing Decouple QM schedulers from links

»collection of largely independent schedulers»may use several to send to the same link

e.g. separate rate classes (1-10M, 10-100M, 100-100M) optionally adjust scheduler rates dynamically

Provide support for multicast»requires addition of next-hop IP address after queueing

Enable single slice to operate at 10 Gb/s Support “slow” code options

»Use separate rate classes to limit rate to slow code options»LCI QMs for Parse, NPUB QMs for HdrFmt

5 - Mike Wilson - 04/19/23

SPP Version 2 System Architecture

GPE Blade

GPE Blade

SPISwitch

Sw

itch

Bla

de

NPUA

NPUB

LCIngress

RTM

LCEgress

FICSPI

Switch FIC

NPE 7010 BladeLC 7010 Blade

1 10Gb/sOR

10 1Gb/s

DecapParseLookupAddShim

CopyQMHdrFormat

Default Data Path

6 - Mike Wilson - 04/19/23

SPP Version 2 System Architecture

GPE Blade

GPE Blade

SPISwitch

Sw

itch

Bla

de

NPUA

NPUB

LCIngress

RTM

LCEgress

FICSPI

Switch FIC

NPE 7010 BladeLC 7010 Blade

1 10Gb/sOR

10 1Gb/s

DecapParseLookupAddShim

CopyQMHdrFormat

Fast-Path Data

7 - Mike Wilson - 04/19/23

SPP Version 2 System Architecture

GPE Blade

GPE Blade

SPISwitch

Sw

itch

Bla

de

NPUA

NPUB

LCIngress

RTM

LCEgress

FICSPI

Switch FIC

NPE 7010 BladeLC 7010 Blade

1 10Gb/sOR

10 1Gb/s

DecapParseLookupAddShim

CopyQMHdrFormat

Exception Data PathLocal Delivery

8 - Mike Wilson - 04/19/23

SRAM

TCAM

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt/SubEncap(4 MEs)

StatsB(1 ME)SRAM/0

NPE Version 2 Block DiagramNPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAM/3

Scr2NN/Freelist(1 ME)

AddShim(1 ME)

Decap(1 ME)

Parse(8 ME)

LookupA(1 ME)

TxA(2 ME)

SPISwitch

9 - Mike Wilson - 04/19/23

SRAM

TCAM

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt/SubEncap(4 MEs)

StatsB(1 ME)SRAM

NPE Version 2 Block DiagramNPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAM

Scr2NN/Freelist(1 ME)

AddShim(1 ME)

Decap(1 ME)

Parse(8 ME)

LookupA(1 ME)

TxA(2 ME)

SPISwitch

10 - Mike Wilson - 04/19/23

PlanetLab NPE Input Frame from LC

Ethernet Header:»DstAddr: MAC address of NPE»SrcAddr: MAC address of LC»VLAN: One VLAN per MR (MR == Slice)

Only use lower 12 bits of Vlan Tag IP Header:

»Dst Addr: IP address of this node How many IP Addresses can a NODE have?

»Src Addr: IP address of previous hop»Protocol: UDP

UDP Header:»Dst Port: Identifies input tunnel»Src Port: with IP Src Addr identifies sending

entity

Type=802.1Q (2B)

PAD (nB)

CRC (4B)

UDP Payload(MN Packet)

Src Addr (4B)

Dst Addr (4B)

Ver/HLen/Tos/Len (4B)ID/Flags/FragOff (4B)

TTL (1B)Protocol = UDP (1B)

Hdr Cksum (2B)

DstAddr (6B)

SrcAddr (6B)

IP Options (0-40B)

Src Port (2B)Dst Port (2B)

UDP length (2B)UDP checksum (2B)

VLAN (2B)Type=IP (2B)

Eth

ern

et

Header

IPH

eader

UD

PH

eader

Eth

ern

et

Tra

iler

Indicates 8-Byte BoundariesAssuming no IP Options

11 - Mike Wilson - 04/19/23

SRAM

TCAM

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt/SubEncap(4 MEs)

StatsB(1 ME)SRAM

NPE Version 2 Block DiagramNPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAM

Scr2NN/Freelist(1 ME)

AddShim(1 ME)

Decap(1 ME)

Parse(8 ME)

LookupA(1 ME)

TxA(2 ME)

SPISwitch

Port(4b)

Reserved(12b)

Eth. FrameLen (16b)

Buffer Handle(24b)Rsv(3b)

Intf(4b)

V1

12 - Mike Wilson - 04/19/23

RxANo change from V1

13 - Mike Wilson - 04/19/23

SRAM

TCAM

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt/SubEncap(4 MEs)

StatsB(1 ME)SRAM

NPE Version 2 Block DiagramNPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAM

Scr2NN/Freelist(1 ME)

AddShim(1 ME)

Decap(1 ME)

Parse(8 ME)

LookupA(1 ME)

TxA(2 ME)

SPISwitch

Rx UDP DPort (16b)Slice ID (VLAN) (16b)

MN Frm Offset (16b)MN Frm Length(16b)

Rx IP SAddr (32b)

Reserved (12b)

Rx UDP SPort (16b)Code(4b)

Slice Data Ptr (32b)

Port(4b)

Reserved(12b)

Eth. FrameLen (16b)

Buffer Handle(24b)Rsv(3b)

Intf(4b)

V1Buffer Handle(24b)Rsv

(3b)Intf(4b)

V1

14 - Mike Wilson - 04/19/23

Decap Inputs:

»Packet from RxA Outputs:

»Meta-frame (handle, offset and length)»Slice ID (VLAN tag)

…or is this lower 12b of VLAN tag and lower 4b of RX DA in?»Metainterface (Rx Saddr, Rx Sport, Rx Dport)»Code Option (4b, only 16 available)»Slice data pointer

Initialization:»VLAN table

Functionality:»Read VLAN tag from DRAM, determine correct code option.»Validate packet. Drop invalid, unmatched packets.

IP Options for NPE dropped in LC, should never arrive here!»Enqueue valid packets to SRAM ring.»Update stats

Status:»Change dl_sink from NN to SRAM.»No longer need to update buffer descriptor.

…except for min-sized packets, which RxA does not update fully (pkt len)

15 - Mike Wilson - 04/19/23

VLAN table

VLAN code_opt slice_data_ptr slice_data_size

0 0 0 0

1 0 0 0

… … … …

0x0aa 1

… … … …

0x7ff 0 0 0

SD data

P data

code_option = 0 implies invalid slice»“on switch” for a slice in the data plane

SD data is currently only counters 64B slice data Only use lower 12b of VLAN tag (4096 VLANs) Only changes from V1:

»No longer need all data on NPUA, drop HF data, per-slice buffer limits

16 - Mike Wilson - 04/19/23

SRAM

TCAM

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt/SubEncap(4 MEs)

StatsB(1 ME)SRAM

NPE Version 2 Block DiagramNPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAM

Scr2NN/Freelist(1 ME)

AddShim(1 ME)

Decap(1 ME)

Parse(8 ME)

LookupA(1 ME)

TxA(2 ME)

SPISwitch

Rx UDP DPort (16b)Slice ID (VLAN) (16b)

MN Frm Offset (16b)MN Frm Length(16b)

Rx IP SAddr (32b)

Reserved (12b)

Rx UDP SPort (16b)Code(4b)

Slice Data Ptr (32b)

Lookup Key[111-80] DA (32b)

MN Frm Length (16b)MN Frm Offset (16b)

Lookup Key[ 79-48] SA (32b)

Lookup KeyProto/TCP_Flags

[15- 0] (16b)

ExceptionBits (12b)

Lookup Key[143-112] Type(1b)/Slice ID(15b)/Rx UDP DPort (16b)

Code(4b)

Lookup Key[ 47-16] Ports (32b)

Buffer Handle(24b)Rsv(3b)

Intf(4b)

V1Buffer Handle(24b)Rsv

(3b)Intf(4b)

V1

17 - Mike Wilson - 04/19/23

Parse Inputs:

» Meta-frame (handle, offset and length)» Slice ID (VLAN tag)» Tunnel ID (Rx Saddr, Rx Sport, Rx Dport)» Code Option (4b, only 16 available)» Slice data pointer

Outputs:» Meta-frame (handle, offset and length)» Lookup key (Includes slice ID, Rx UDP dport)

Change to include lower 4b of RX DA in; shave VLAN bits for the SliceID.» Code Option (4b, only 16 available)» Exception bits (MN-specific)

Initialization:» Slice Data

Functionality:» Slice-specific processing:

Parse meta-frame. Extract lookup key. Raise any relevant exceptions. Can pass slice data to HdrFmt in bytes 16..30 of packet. (0..15 are reserved for AddShim)

» Substrate processing: Add substrate-specific information to lookup key (32b: Lookup type, Slice ID, Rx UDP dport)

Status:» Change to multi-ME synchronization» Read, write to SRAM rings» No longer need all V1 outputs; some have been removed and the rest compacted.

(This change is optional, but may remove a memory access) Slice data pointer, Rx UDP sport, Rx UDP Saddr

18 - Mike Wilson - 04/19/23

SRAM

TCAM

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt/SubEncap(4 MEs)

StatsB(1 ME)SRAM

NPE Version 2 Block DiagramNPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAM

Scr2NN/Freelist(1 ME)

AddShim(1 ME)

Decap(1 ME)

Parse(8 ME)

LookupA(1 ME)

TxA(2 ME)

SPISwitch

Lookup Key[111-80] DA (32b)

MN Frm Length (16b)MN Frm Offset (16b)

Lookup Key[ 79-48] SA (32b)

Lookup KeyProto/TCP_Flags

[15- 0] (16b)

ExceptionBits (12b)

Lookup Key[143-112] Type(1b)/Slice ID(15b)/Rx UDP DPort (16b)

Code(4b)

Lookup Key[ 47-16] Ports (32b)

Result Index (32b)

ExceptionBits (12b)

Slice ID (VLAN) (16b)Code(4b)

MN Frm Length (16b)MN Frm Offset (16b)

Rsvd(16b) Stats Index (16b)

Buffer Handle(24b)Rsv(3b)

Intf(4b)

V1

Buffer Handle(24b)Rsv(3b)

Intf(4b)

V1

19 - Mike Wilson - 04/19/23

LookupA Inputs:

» Meta-frame (handle, offset and length)» Lookup key (Includes slice ID, Rx UDP dport)» Slice ID (VLAN tag)» Code Option (4b, only 16 available)

Outputs:» Meta-frame (handle, offset and length)» Lookup Result (Index into SRAM table on NPUB)

32b is overkill; some of these bits are reserved.» Slice ID (VLAN tag)» Code Option (4b, only 16 available)» Exception bits (from Parse)» Stats Index (from TCAM)

Initialization:» Filters set in TCAM by control

Functionality:» Look up key in TCAM» On miss, drop the packet

Status:» Local Delivery is now handled at LookupB in SRAM table» Lookup result is now just a 32b index» No longer need all V1 input/outputs; some have been removed and the rest

compacted.(This change is optional)

20 - Mike Wilson - 04/19/23

SRAM

TCAM

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt/SubEncap(4 MEs)

StatsB(1 ME)SRAM

NPE Version 2 Block DiagramNPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAM

Scr2NN/Freelist(1 ME)

AddShim(1 ME)

Decap(1 ME)

Parse(8 ME)

LookupA(1 ME)

TxA(2 ME)

SPISwitch

Result Index (32b)

ExceptionBits (12b)

Slice ID (VLAN) (16b)Code(4b)

MN Frm Length (16b)MN Frm Offset (16b)

Rsvd(16b) Stats Index (16b)

Buffer Handle(24b)Rsv(3b)

Intf(4b)

V1

Buffer Handle(24b)Rsv(3b)

Intf(4b)

V1

21 - Mike Wilson - 04/19/23

AddShim Inputs:

» Meta-frame (handle, offset and length)» Lookup Result (Index into SRAM table on NPUB)» Slice ID (VLAN tag)» Code Option (4b, only 16 available)» Exception bits (from Parse)» Stats Index (from TCAM)

Outputs:» Shim Packet (buffer handle)

Buffer descriptor contains updated offset and length, if needed Initialization:

» None. Functionality:

» Prepend shim header to preserve packet annotations across NPU’s» Overwrite the existing ethernet header (Up to 18B) with:

Slice ID (16b) Code Option (4b) Exception Bits (12b) MN Frame Offset (16b) Result Index (32b) Stats Index (16b) [This is the same on NPUA, NPUB] 32B for opaque slice data.

Proper memory alignment required This is written by Parse, not AddShim!

Status:» New. Stub version is written.

22 - Mike Wilson - 04/19/23

SRAM

TCAM

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt/SubEncap(4 MEs)

StatsB(1 ME)SRAM

NPE Version 2 Block DiagramNPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAM

Scr2NN/Freelist(1 ME)

AddShim(1 ME)

Decap(1 ME)

Parse(8 ME)

LookupA(1 ME)

TxA(2 ME)

SPISwitch

Buffer Handle(24b)Rsv(3b)

Intf(4b)

V1

23 - Mike Wilson - 04/19/23

TxASends shim packet to NPUB.Unmodified 10 Gbps Tx 2×ME.

24 - Mike Wilson - 04/19/23

SPP Version2 NPUA to NPUB Frame

SHIM (16B)»Slice ID (16b)»Code Option (4b)»Exception Bits (12b)»Result Index (32b)»Stats Index (16b)»Offset of MN Packet (16b)»Memory Alignment Padding (4B)

IP Header, UDP Header may be overwritten by:»opaque slice data, written in Parse

PAD (nB)

CRC (4B)

UDP Payload(MN Packet)

Src Addr (4B)

Dst Addr (4B)

Ver/HLen/Tos/Len (4B)ID/Flags/FragOff (4B)

TTL (1B)Protocol = UDP (1B)

Hdr Cksum (2B)

SHIM (16B)

IP Options (0-40B)

Src Port (2B)Dst Port (2B)

UDP length (2B)UDP checksum (2B)

Type=IP (2B)IP

Header

UD

PH

eader

Eth

ern

et

Tra

iler

Indicates 8-Byte BoundariesAssuming no IP Options

25 - Mike Wilson - 04/19/23

SRAM

TCAM

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt/SubEncap(4 MEs)

StatsB(1 ME)SRAM

NPE Version 2 Block DiagramNPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAM

Scr2NN/Freelist(1 ME)

AddShim(1 ME)

Decap(1 ME)

Parse(8 ME)

LookupA(1 ME)

TxA(2 ME)

SPISwitch

Port(4b)

Reserved(12b)

Eth. FrameLen (16b)

Buffer Handle(24b)Reserved(8b)

26 - Mike Wilson - 04/19/23

RxBNo change from V1

27 - Mike Wilson - 04/19/23

SRAM

TCAM

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt/SubEncap(4 MEs)

StatsB(1 ME)SRAM

NPE Version 2 Block DiagramNPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAM

Scr2NN/Freelist(1 ME)

AddShim(1 ME)

Decap(1 ME)

Parse(8 ME)

LookupA(1 ME)

TxA(2 ME)

SPISwitch

Frame Length (16b)Stats Index (16b)

Buffer Handle(24b)Reserved(8b)

Reserved(12b)

PerSchedQID(15b)

Sch3b

QM2b

Port(4b)

Reserved(12b)

Eth. FrameLen (16b)

Buffer Handle(24b)Reserved(8b)

28 - Mike Wilson - 04/19/23

LookupB/Copy Inputs:

»Shim packet (buffer handle, frame length)Outputs:

»Packet (buffer handle, frame length)»QueueID (QM, Scheduler, Queue ID)»Stats Index

Initialization:»ResultTable

Functionality (Overview)»Copy shim header into buffer descriptor»Look up routing information from result index»If multicast, make the copies»Enqueue to correct QM (from ResultTable)

29 - Mike Wilson - 04/19/23

LookupB/Copy – Code Sketchif not currently processing mcast packet

read packet from SRAM ringextract shimload ResultTable valuefill buffer descriptorif unicast

if per-slice packet limit permitsupdate per-slice packet countwrite to SRAM ring for correct QM. (By qmschedID in result table value).

else drop bufferelse

start mcast processingif per-slice packet limit permits

update per-slice packet countfetch first header buffer descriptorif payload length ≠ 0

write ref count into payload descriptorelse drop payload buffer

elsedrop bufferfinish mcast processing

else (Currently processing buffer, have empty header buffer handle)fill header buffer descriptor

only chain if payload buffer is not emptyif still making copies

fetch next header buffer descriptorelse finish mcast processingwrite current header buffer handle to SRAM ring for correct QM. (By qmschedID).

signal next ME

30 - Mike Wilson - 04/19/23

ResultTable – Unicast Data needed to enqueue, rewrite packet:

»QID QMID, SchedID, QID (20b) (Lookup Result)

»Src MI: IP Saddr (32b) (Per SchedID Table) UDP Sport (16b) (Lookup Result)

»Tunnel Next Hop IP DAddr (32b) (Lookup Result) IP DPort (16b)(Lookup Result)

»Chassis Addressing Ethernet Dst MAC (48b) (Per SchedID Table)

»Slice Specific Lookup Result Data (?) (Lookup Result)

Ethernet Src MAC»Should be constant across all pkts.

QID (20b)IP DAddr (32b)

UDP DPort (16b)UDP SPort (16b)

Results Entry:

IP SAddr (32b)Eth DA (48b)

Per Sched Entry:

HFIndex (16b)

31 - Mike Wilson - 04/19/23

ResultTable – Multicast Fanout gives the number of copies (0..15) Data needed per copy on NPUB:

»QID QMID, SchedID, QID (20b) (Lookup Result)

»Src MI: IP Saddr (32b) (Per SchedID Table) UDP Sport (16b) (Lookup Result)

»Tunnel Next Hop IP DAddr (32b) (Lookup Result) IP DPort (16b)(Lookup Result)

»Chassis Addressing Ethernet Dst MAC (48b) (Per SchedID Table)

»Slice Specific Lookup Result Data (?) (Lookup Result)

Ethernet Src MAC»Should be constant across all pkts.

Support Multicast but optimize for Unicast

Fanout (4b)QID (20b)

IP DAddr (32b)UDP DPort (16b)UDP SPort (16b)

Results Entry:

IP SAddr (32b)Eth DA (48b)

Per Sched Entry:

HFIndex (16b)

×16

32 - Mike Wilson - 04/19/23

SRAM

TCAM

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt/SubEncap(4 MEs)

StatsB(1 ME)SRAM

NPE Version 2 Block DiagramNPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAM

Scr2NN/Freelist(1 ME)

AddShim(1 ME)

Decap(1 ME)

Parse(8 ME)

LookupA(1 ME)

TxA(2 ME)

SPISwitch

Frame Length (16b)

Buffer Handle(24b)

Stats Index (16b)

Reserved(8b)

Reserved(12b)

PerSchedQID(15b)

Sch3b

QM2b

Buffer Handle(24b)Rsv(3b)

Intf(4b)

V1

33 - Mike Wilson - 04/19/23

QMNo change from V1

»Incorporates recent change to limit queues by #pktsSome changes in how control allocates bandwidth

»Need to ensure that slow HdrFmt blocks can’t tie up the system

»Currently looking at worst-case engineering (everyone runs at slowest block speed)

34 - Mike Wilson - 04/19/23

SRAM

TCAM

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt/SubEncap(4 MEs)

StatsB(1 ME)SRAM

NPE Version 2 Block DiagramNPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAM

Scr2NN/Freelist(1 ME)

AddShim(1 ME)

Decap(1 ME)

Parse(8 ME)

LookupA(1 ME)

TxA(2 ME)

SPISwitch

Buffer Handle(24b)Rsv(3b)

Intf(4b)

V1Buffer Handle(24b)Rsv

(3b)Intf(4b)

V1

35 - Mike Wilson - 04/19/23

HdrFmt / SubEncap Inputs:

» Buffer Handle» Remaining inputs from Buffer Descriptor:

Multicast or Unicast (from buffer_next) Frame length, offset HFIndex (index into HFTable, a slice-specific table) QMSchedID (for per-sched lookup in ResultTable)

Outputs:» Packet (buffer handle)

Buffer descriptor contains updated offset and length Initialization:

» HFTable, containing slice-specific data. For IPv4, this contains next-hop information (for both multicast and unicast traffic).

Functionality:» Substrate level:

read buffer descriptor and pass frame offset, length, HFIndex, mcast/ucast to slice-specific HdrFmt

» Slice level: arbitrary processing. For IPv4, this writes the next-hop information. Returns new offset, length of frame.

» Substrate level: Encapsulate for output tunnel (from ResultTable)

Status:» Significant re-write at substrate level» Slice-specific code should change very little (add multicast support)

36 - Mike Wilson - 04/19/23

SRAM

TCAM

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt/SubEncap(4 MEs)

StatsB(1 ME)SRAM

NPE Version 2 Block DiagramNPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAM

Scr2NN/Freelist(1 ME)

AddShim(1 ME)

Decap(1 ME)

Parse(8 ME)

LookupA(1 ME)

TxA(2 ME)

SPISwitch

Buffer Handle(24b)Rsv(3b)

Intf(4b)

V1Buffer Handle(24b)Rsv

(3b)Intf(4b)

V1

37 - Mike Wilson - 04/19/23

Scr2NN/FreelistMgr Inputs:

»Buffer Handle (possibly chained) Outputs:

»Buffer Handle (possibly chained) Initialization:

»None Functionality:

»Combines Freelist Manager with Scr2NN glue»FM: Read from scratch ring. Free buffers, correctly handling

chained buffers and reference counts.»Scr2NN: Read from Scratch, write to NN.

Status:»Both blocks exist, but combining them is not straight-forward.

Open question: how should we prioritize among these tasks? The author should ensure that no deadlock is possible. (TxB writes to FM; if FM ring is full, TxB stalls. If Scr2NN is writing to TxB, it stalls. Gridlock.)

38 - Mike Wilson - 04/19/23

SRAM

TCAM

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt/SubEncap(4 MEs)

StatsB(1 ME)SRAM

NPE Version 2 Block DiagramNPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAM

Scr2NN/Freelist(1 ME)

AddShim(1 ME)

Decap(1 ME)

Parse(8 ME)

LookupA(1 ME)

TxA(2 ME)

SPISwitch

Buffer Handle(24b)Rsv(3b)

Intf(4b)

V1

39 - Mike Wilson - 04/19/23

TxBMust support chained buffers

»Multicast uses header buffers and payload buffers»Headers are slice-specific; we can’t rely on known, static lengths as we did in ONL.

Sends header from one buffer, payload from chained buffer.»Can TX do this? Comments in the code seem to imply that

chained (non-SOP) buffers must start at offset 0. Our payloads usually won’t.

According to DZar, this will probably take some TX modification, but there’s no reason why it won’t work. Might have a performance penalty, of course….

40 - Mike Wilson - 04/19/23

SPP V2 SideB SRAM Buffer Descriptor

HFIndex is an index into the HFTable. For IPv4, this provides Next Hop information.

ResultIndex is used to get tunnel header info from the ResultTable

Buffer_Next (32b)LW0

LW1

LW2

LW3

LW4

LW5

LW6

Packet_Next (32b)LW7

Reserved (4b)

Free_list0000 (4b)

Ref_Cnt (8b)

Slice ID(xsid)(12b)Stats Index (16b)

ResultIndex (32b)

Buffer_Size (16b)

Packet_Size (16b)

Offset (16b)

Reserved (4b)

MR Exception Bits (16b)HFIndex (16b)

MR Bits (optional) (32b)

41 - Mike Wilson - 04/19/23

Design Questions Small hole for abuse in HdrFmt

»QM rate limits on payload length»HdrFmt (after QM) can vastly increase packet length»Should the LookupB table give the padding size for each entry?

Enforced in SubEncap?»ANSWER: No, we will resort to our control of HdrFmt to force it to

behave. (We write all of the code options right now.)

What are the best places to update stats on NPUB?»ANSWER: Post-Q only

Is there any remaining reason that NPUB would need the source tunnel information?»ANSWER: No. If a code option needs it, put it into opaque slice data.

Still working out remaining data areas.

42 - Mike Wilson - 04/19/23

Extra SlidesThe rest of the slides are old or for extra

information

43 - Mike Wilson - 04/19/23

Questions/Issues 4/28/08:

»How many code options? Limit of 16?

»To handle slow Code Options: LCI Queues would control traffic to Fast/Slow Parse Code

Classes of code options defined by how long their Parse code takes. Scheduler assigned to a class of code option.

NPE Queues would control traffic to Fast/Slow HF Code LCE Queues control the output rate to Interfaces.

»Multicast Problems: Impact of multicast traffic overloading Lookup/Copy and becoming a

bottleneck.»Rx on SideB, can it use SRAM output ring?

All our other 10G Rx’s have NN output ring.»Option for HF to send out additional pkts?»How to pass MR and substrate hdrs to TxB?

Through Ring or through Hdr Buffer associated with Hdr Buffer descriptor.

If the latter then what are the constraints in Tx for buffer chaining?

44 - Mike Wilson - 04/19/23

Meeting Notes1/15/08:

»QM: Add Pkt count to Queue Params, change limit from QLen to PktCount

»Add Per Slice Pkt limit to NPUA and NPUB»Limit Fanout to 16»MCast: Control will allocate all 16 entries for a multicast result entry, result entry will be typed as multicast or unicast and will not transition from one to the other.

»What happens to pkts in queues when there is a route change that sends that flow’s pkts to a different interface and queue? Pkt ordering problems?

45 - Mike Wilson - 04/19/23

SRAM

TxA(2 ME)

TCAM

Decap, Parse, LookupA, AddShim(8 MEs)

SRAM

Stats(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt(4 MEs)

Stats(1 ME)SRAM

NPE Version 2 Block DiagramLookup produces

resultIndx, statsIndx

slice#, resultIndx

, etc, passed in

shim

Lookup on <slice#, resultIndx>

yields fanout, list of QiDs;copy to queues, adding

copy#;(slice#, resultIndx remain

in packet buffer)

use slice# to select slice to format packet; use resultIndx to get

next-hop

flow

contr

ol?

for unicast, resultIndx replaced by QiD; allowing output side to skip lookup

SPISwitch

NPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

46 - Mike Wilson - 04/19/23

Questions/Issues Where are exit and entry points for packets sent to and from the GPE for exception

processing?» Parse (NPUA) and LookupA (NPUA) are where most exceptions are generated:

IP Options No Route Etc.

» HdrFormat (NPUB) is where we do ethernet header processing What needs to be in the SHIM going from NPUA to NPUB?

» ResultIndex (32b)» Exception Bits (12b)» StatsIndex (16b)» Slice# (12b)» ???

Will we support multi-copy in a way similar to the ONL Router? How big can the fanout be?

» How many QIDs need to be stored with the LookupB Result? Is there some encoding for the QIDs that can take into account support for multicast and the copy#? For

example: Multicast QID(20b)

– Multicast (1b): 1 – Copy# (4b)– PerMulticast QID(15b): One PerMulticast QID allocated for each Multicast

Unicast QID(20b)– Unicast (1b): 0– QID (19b)

Are there timing/synchronization issues with adding, deleting or changing lookup entries between the two NPUs databases?

Do we need flow control between TxA and RxB?

47 - Mike Wilson - 04/19/23

SRAM

TxA(2 ME)

TCAM

Decap, Parse, LookupA, AddShim(8 MEs)

SRAM

Stats(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt(4 MEs)

Stats(1 ME)SRAM

NPE Version 2 Block Diagram

flow

contr

ol?

SPISwitch

NPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

NPUA:»RxA:Same as Version 0»TxA: New 10Gb/s »Decap: Same as Version 0»Parse: Same as Version 0

New code options?»LookupA: Results will be different from Version 0»AddSim: New

48 - Mike Wilson - 04/19/23

SRAM

TxA(2 ME)

TCAM

Decap, Parse, LookupA, AddShim(8 MEs)

SRAM

Stats(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt(4 MEs)

Stats(1 ME)SRAM

NPE Version 2 Block Diagram

flow

contr

ol?

SPISwitch

NPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

NPUB:»RxB:Same as Version 0»TxB: New 10Gb/s

with L2 Header coming in on input ring?»LookupB: New»Copy: New, may be able to use some code from ONL Copy»QM: New, decoupled from Links »HF: New, may use some code from Version 0

49 - Mike Wilson - 04/19/23

SRAM

TxA(2 ME)

TCAM

Decap, Parse, LookupA, AddShim(8 MEs)

SRAM

StatsA(1 ME)

RxA(2 ME)

SRAM

SRAMSRAM

QueueManager(4 MEs)

RxB(2 ME)

TxB(2 ME)

LookupB&Copy(2 ME)

HdrFmt(4 MEs)

StatsB(1 ME)SRAM

NPE Version 2 Block Diagram

flow

contr

ol?

SPISwitch

NPUA

NPUB

SPISwitch

Sw

itch

Bla

de

GPE

SRAMFreeList

MgrB(1 ME)

Scr2NN(1 ME)

Sram2NN(1 ME)

NPUB has 17 MEscurrently spec’ed

FreeList MgrA

(1 ME)

50 - Mike Wilson - 04/19/23

SPP V2: MR Specific Code Where does the MR Specific Code reside in V2:

»Parse»HdrFormat

What about LookupA and LookupB?»Lookup is a “service” provided to the MRs by the Substrate.»No MR specific code needed in LookupA or LookupB

What about SideA AddShim?»The Exception bits that go in the shim are MR Specific but they should

be passed to AddShim and it will write them into the Shim. »No MR Specific code needed in AddShim.

What about SideB Copy?» Is there anything MR specific about setting up multiple copies of a

packet? There shouldn’t be. We will have the Copy block allocate a new hdr buffer

descriptor and link it to the existing data buffer descriptor and take care of reference counts.

The actual building of the new header(s) for the copies will be left to HF.»No MR Specific code needed in Copy.

51 - Mike Wilson - 04/19/23

SPP V2: Hdr Format Lots of changes for HF:

» Move behind QM» More general:

Support multiple source IP Addresses General support for Tunnels

Eventually different kinds of tunnels (UDP/IP, GRE, …)?» Support for Multicast

Dealing with header buffer descriptors Reading Fanout table

» Substrate portion of HF will need to do Decap type table lookup Slice ID (Code Option, Slice Memory Pointer, Slice Memory Size)

HF gets a buffer descriptor from the QM» The Substrate portion of HF must determine:

Code Option (8b) Slice ID (12b) Location of Next Hop information (20b - 32b)

LD vs. FWD? Stats Index (16b)

Should HF do this of QM?» The MR portion of HF must determine:

Exception bits (16b) Lets put all of the above data in the Buf Desc

» LookupB/Copy will need to write it there based on what comes across from SideA in the shim

52 - Mike Wilson - 04/19/23

SPP V2: ResultWe need to be much more general in our support

for Tunnels, Interfaces, MetaInterfaces, and Next Hops.

SideB Result:»Interface

IP SAddr (32b) Eth MAC DAddr (48b) (LC, GPE1, GPE2, …, GPEn) SchedulerId (8b): which QM should handle pkt

»TxMI: IP Sport (16b)

»TxNextHop: IP DAddr (32b) IP DPort (16b)

53 - Mike Wilson - 04/19/23

Data AreasWhere are the tables and what data is transmitted

from SideA to SideB?

SideA TablesShim between SideA and SideBSideB Tables

54 - Mike Wilson - 04/19/23

Pkt Processing Data and Tables SideA:

»MR/Slice Table: Generated by Control Used by:

Substrate Decap to retrieve a MR/Slice’s parameters Indexed by SliceId == VLAN Contains:

– Code option– Slice Memory ptr– Slice Memory size– ???

»TCAM: Generated by Control Used by:

LookupA Contains:

Key: Result:

55 - Mike Wilson - 04/19/23

Data Areas Shim between SideA and SideB

»Written to DRAM Buffer to be sent from SideA to SideB»Contains:

resultIndex (32b): Generated by Control Result of TCAM lookup on SideA Translates into an SRAM Address on SideB

exceptionBits (12b) Generated by SideA Parse/Lookup Used by:

– SideB HF statsIndex (16b)

Generated by Control Result of TCAM lookup on SideA Used by:

– SideA Lookup/AddShim to increment counters– SideB Lookup/Copy to increment PreQ Cntrs (or perhaps SideA is the PreQ cntrs)– SideB HF or QM to increment PostQ Cntrs

sliceId (12b) Generated by Control Result of Decap read of Ethernet hdr (VLAN) Used by:

– ??? codeOption (4b) Slice Memory Ptr (32b)

56 - Mike Wilson - 04/19/23

Data Areas SideB

»Data Buffer Descriptor»Hdr Buffer Descriptor

Used for multi-copy packets SPP V2 may require Tx to handle multi-buffer packets.

It is unclear if we can cleanly do that same thing that we do with ONL where HF passes the Ethernet header to Tx.

We may also need to have support for MR specific per copy data»Results Table

Generated by Control Used by:

LookupB/Copy HF

– Should HF get its per copy info from here as well. Contains:

Fanout (if fanout is > 1 we can overload some of the following fields with a pointer into a Fanout table)

QID InterfaceId TxMI Id

– Probably doesn’t help to make it an index into a table for UDP Tunnels since UDP Port is 16 bits

– But for tunnels other than UDP tunnels it may help? TX NextHop Id

– Index into a table of Tunnel Next Hops

57 - Mike Wilson - 04/19/23

Data Areas (continued)SideB (continued)

»Fanout Table Generated by Control Used by:

LookupB/Copy HF

Contains: QID[Fanout] InterfaceId TxMI Id Tx Next Hop ID[Fanout]

Implementation Choices: One contiguous block of memory

– Fixed size or variable sized Chained with one set of values per entry Chained with N (N=4?) sets of values per entry