Washington WASHINGTON UNIVERSITY IN ST LOUIS [email protected] Substrate Control: Overview Fred...

26
Washington WASHINGTON UNIVERSITY IN ST LOUIS [email protected] Substrate Control: Overview Fred Kuhns [email protected] Applied Research Laboratory Washington University in St. Louis

Transcript of Washington WASHINGTON UNIVERSITY IN ST LOUIS [email protected] Substrate Control: Overview Fred...

Page 1: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

[email protected]

Substrate Control: Overview

Fred [email protected]

Applied Research Laboratory

Washington University in St. Louis

Page 2: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

2WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Defining Terms and Models

Page 3: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

3WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

The SPP Node• Slice instantiation:

– Allocate virtual machine (VM)instance on a GPE

– may request code option instance, NPE resources and bandwidth

• Share a common set of (global) IPaddresses

– UDP/TCP port space shared across GPE/NPEs

• Line card TCAM Filters direct traffic– unregistered traffic originating outside the node

is sent to the CP.

– unregistered traffic originating within node usesNAT (on line card)

– application may register server ports. Causes filter to be inserted in the line card directing traffic to specific GPE

– application must register ports (or tunnels) associated with fast path instances

• It is assumed that fast path instances will use tunnels (overlays) to send traffic between routing nodes.

– Currently we only support UDP tunnels but will extend to include GRE and possibly others.

GPE

RMP

NMP

planetlab OS

vmx

app

NPE

SRAM

TCAM

SCD

mi-mux

code option

FPx GPENPE

LC

Internet

Ingress

… map flowto internaldestination

…Egress

…IP route table and

ARP

SCD (ARP, nat)

local delivery/exceptions,uses an Internal UDP Tunnel

Page 4: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

4WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Meta-Interfaces and Tunnels• Slice Fast path (Code option instance, allocated resources) are assumed to sit at one end of a tunnel

– currently only UDP tunnels are supported.– UDP Tunnel is defined by the 4-tuple

UDP tunnel: {peer ipaddr, peer port, local ipaddr, local port}– Meta-interface or MI: Represents a tunnel endpoint as viewed by a slice’s the fast path router. A meta-interface

is defined by the local endpoint’s addressMeta-Interface: {local ipaddr, local UDP port}

• The encapsulated packet is processed by the fast path.– packet is always encapsulated within a tunnel by the substrate– code option instance processes the encapsulated frame

• In the SPP context, slice registers MI and substrate manages encapsulation headers:– Guard against forging source address– A filter is installed in the corresponding line card’s TCAM to send matching packets to the correct NPE– NPE’s decap module verifies the encapsulation header and provides isolation between slices (based on local IP

and port number values in the tunnel header)– Fabric VLANs are used to provide link level isolation between slice instances. The VLAN label is also used by

the substrate to associate packets with slice fast paths.

meta-interfaces

MI: local tunnel endpoint (UDP), {external ipaddr, udp_port}

fast path (FPx)0 1 2 3 4 5 6

MI IP Address UDP Port

0 192.168.1.2 6060

1 192.168.1.3 6060

2 192.168.1.2 6061

3 192.168.1.2 6062

4 192.168.1.3 6061

5 192.168.1.3 6062

6 192.168.1.3 6063

Page 5: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

5WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Lookup Table, TCAM, Use

Page 6: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

6WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Lookup filters: Key, Action and Result• A lookup key is then created from the packet’s header fields and the receiving meta-

interface – code option extracts fields from the encapsulated packet– substrate adds the receiving meta-interface identifier

• If no entry is found then the packet’s no_route exception attribute is set, otherwise a result is returned containing an action field and forwarding information (output meta-interface and next hop address)

– a code option may define additional exception attributes• The complete filter specification: {lookup_key, result_vector}• lookup_key : {RxMI, *copt_key}

– RxMI : Meta interface ID on which the packet was received.– copt_key : Lookup key defined by the code option. The IPv4 key:

{daddr(32),saddr(32),sport(16),dport(16),tcp_flgs(8),proto(8)}• result_vector : {sindx, action[, qid, TxMI, nexthop]}

– sindx : stats index– action: Packet disposition, one of {drop, fwd, ld}

• drop : drop packet; • fwd : forward packet using next hop value (fwdkey)• ld : local delivery, code option instance has local address information??

– qid : packet Queue– TxMI : Meta-interface used for sending packet, corresponds to a previously registered local tunnel

endpoint. Used to fill in the local address of the outgoing packet tunnel header.– nexthop : Tunnel endpoint for the next hop. For UDP tunnels, this is the IP address and UDP

port number of the next hop device.

Page 7: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

7WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Slice view of the Lookup Key

• When a packet is received the substrate creates a lookup key using the target slices xsid and the receiving meta-interface. The remaining bits are defined by the code option. – xsid’ : represents the internal slice ID and may differ from the value of xsid.

For implementation efficiency, this is the VLAN identifier assigned to the slice. – xmi : Internal representation of the meta-interface (MI), encoding of the

received tunnel endpoint.• For UDP tunnels this field includes a 4-bit interface id and the 16 bit local UDP port

number. The 4-bit id is used as an index into a table of local IP addresses.

• The IPv4 code option defined fields are shown below where pr is the IP protocol field and tcp is the TCP header flags.

slice defined fieldsxmixsid’128-NN12

user specified lookup key (4 - 32-bit words)

Page 8: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

8WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

IPv4 TCAM Filter Formats (on NPE)

6 82flags

2 12

0100

2

TCP RSV proto00!TCP

daddr saddr sport dport tcp/proto

Defined by the IPv4 Code Option, 112bits

32 32 16 16 16

vlan

11

if

T = 0: Normal LookupT = 1; substrate only lookup

T

1

RX port

Substrate defined

164

TX IP daddr TX dport TX sport rsv

32 16 12 1516

QM

3

D: Drop packetL: Local delivery

rsv

113 1

L

1

Drsv sindx Sch2

qid

16

20-bit internal qid(SCD maps slice’s miidto QM and Sch. SCD Also

maps slice’s qid toglobal qid value)

TX IP address and sport representsthe output meta-interface. The

dport is provided by the slice. (RMP maps miid to tx tunnel params,

use dport provided by slice)

Result, 64 bits

Represents input meta-interface

global statsindex

(SCD mapsslice’s sindx

to global value)

Key: Input miid, IPv4 fltr {daddr, saddr, sport, dport, tcp/proto}

Result: Flags {Drop, GPE}, sindx, Output miid, QID

Slice parameters:

Page 9: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

9WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Lookup• Parse block make copt_key.• Substrate add the xsid and xmi fields.• Substrate uses the TxMI and nexthop fields to construct

encapsulation header

......

xsid:RxMI:copt_keyLookup A

slice defined fieldsxmixsid’

sindx;action:qid:TxMI:nexthop

packet

annotations:{xsid, RxMI}

parse block

decap

TxMI:nexthop

Page 10: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

10WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Version 2 and Multicast

......

lookup_key action:sindx:rindxLookupA

slice defined fieldsxmixsid’

result_index

packet

annotations:{xsid, RxMI}

parse block

decapoverloaded with fanout address

fanout Table

...

qid:TxMI:nexthop

• In version 2 there will be 2 stages to the lookupadd fanout (count) to lookup B.

• if fanout > 1 then address of fanout else result vector; Chain fanout blocks

• TxMI includes an interface vector: 4-bit field that is used to lookup interface IP address and MAC address.

...

rindx

sindx:qid:TxMI:nexthop

LookupB

sindex passed from side A

VLAN table in header formatand VLAN table in Decap/Parse

Page 11: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

11WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

• Then the control software could use the following:write_fltr(fid, rxmi, {prefix,width}, action, {qid,TxMI,nexthop})write_fltr(0, *, {10.10.2.1, 0xFFFFFFFF}, LD})write_fltr(1, *, {10.5.2.0, 0xFFFFFF00}, fwd, {1, 1, NHA})write_fltr(2, *, {10.5.1.0, 0xFFFFFF00}, fwd, {2, 2, NHB})write_fltr(3, *, {10.5.0.0, 0xFFFF0000}, fwd, {3, 3, NHC})

Lookup Example• When a code option is requested the slice is

allocated the requested number of TCAM entries; fid ε {0,..., Nf-1}

– all TCAM operations accept a TCAM entry ID (fid)

– Entries are listed in priority order with fid=0 the highest priority and entry Nf-1 the lowest.

• It is up to the slice control path to order the lookup entries.

– For example if we have the simple routing database:

10.10.2.1/32 Local delivery (GPE)

10.5.2.0/24 NH A10.5.1.0/24 NH B10.5.0.0/16 NH C

prefix TxMI nexthop10.10.2.1/32 0* Local10.5.2.0/24 1 NH A10.5.1.0/24 2 NH B10.5.0.0/16 3 NH C

MI IP Address UDP Port0 192.168.1.2 60601 10.50.10.2 60612 10.50.10.2 60623 10.1.1.1 6060

QID Interface BW max Bytes0 0* - Local*1 1 40% 10242 1 60% 10243 2 100% 1024

Interface BW ipAddr0* BE 192.168.1.21 100Mbps 10.50.10.22 10Mbps 10.1.1.1

Desired Route Table (LPM)

Slice BW AllocationsSlice Meta-Interfaces

Slice Queue Bindings

Page 12: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

12WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Example IPv4 LPM• In general for longest prefix match a good strategy is to

divide allocated filters into 32 sets

• For example assume 1024 TCAM entries have been allocated and we are using LPM.– Divide the filters into 32 sets of 32 filters each and associate a prefix

length with each:

– Then for a particular prefix width add it to the appropriate set.

– Entries within a set are non-overlapping so their order doesn’t matter.

– This is the scheme used by software written by IDT, the manufacturer of the TCAM we currently use.

Prefix Width Filter ID Range32 0 - 3131 32-63w (32-w)*32 +(0...31)1 992 - 1023

Page 13: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

13WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Keeping track of TCAM entries• Substrate will have to manage the mapping of VM

TCAM filter IDs to the actual filter ID.• VM control software will use a normalized filter index

list (starts at 0 and has the requested number of filters entries).

• The SCD (xscale daemon) must map the per-VM index into the actual TCAM Index.

• Source for managing TCAM entries.• NPU A and B share a common TCAM and index range

so this must be managed across the two xscales. – See C++ implementation of the RangeMap class in

$WUSRC/range – Class will also be used for managing the QID name space.

Page 14: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

14WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Control Software:Resource Management

Page 15: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

15WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

node components not in hub(switch, GPEs, Development Hosts)

FPkFPkFPx

NPE

SRAM

TCAM

SCDLC

SCD

TCAMMUX

SRM

Resource DB

System Resource Manager

Exception and Local delivery traffic. Includes shim header with RxMI.

SNM

CP

GPE

RMP

NMP

planetlab OSroot context

vmx

control

Support fast path configuration via

the PLC

vnet

SP

Page 16: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

16WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Partitioning of (substrate) Responsibilities• Virtual Machine (Slice control SW): Application logic, code option specific control and data

operations.– traditional PlanetLab slice operations– manage code option specific lookup tables, stats, memory and configuration blocks– implements interface with fast path for exception and local delivery traffic

• vnet– flow isolation: filtering traffic through the linux kernel– add support for VLAN- based filtering and port reservation

• Resource Manager Proxy (aka Local Resource Manager)– all VM commands are issued to the RMP

• the RMP is able to validate command sender (authenticate)• enforce access restrictions (authorize)• decouples VMs from substrate control entities. That is, maps exported abstractions and interfaces to specific hardware and

software interfaces.– verifies (or inserts) substrate message header slice IDs to prevent deliberate or accidental masquerading - part

of ensuring isolation and security.– in tandem with SRM implements device independent logic

• System Resource Manager– device independent logic– responsible for implementing and enforcing

• system resource abstractions• resource isolation and allocation policies• facilitating SNM: implementing PlanetLab compatible behavior and abstractions

• Substrate Control Daemon– intermediary between VM and code option instances (vouches for VM)– enforces policies on resource allocations and isolation in the control plane– implements device dependent logic

Page 17: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

17WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Responsibilities

endpoint (port) mapsresvMap availMap usedMaps xsidMap

Systemtables

Interfacesifn:{type,ipaddr,linkBW,availBW}...

Per SliceTables

xsid

vlan

meta-ifacesmi:endpoint...

endpointsid:{type,ipaddr,port,proto,board,bw}...

gpe

board id BW

plab sliceID

NPE (allocated)sram {start,size} #flts

#Qsboard ID BW #Stats SRM(the “Decider”)

Per Slice data

xsid: {qidMap,FidMap,statsMap}Interface BW

Slice Mapsxsid: {sram_start,sram_size}

Slice SRAM Assignments

SCD (NPE)SRAMbase

xsid:size

xsid:offsetLookup Table

xsid:range

Queue Params

xsid:range

Stats Table

xsid:range

Tables in data Path

VLAN Tablevlan

copt:sram_addr

ranges are not required to be contiguous

“real”indx

“real”indx

sid

fid

“real”indx

qidHF Control Block?

code optioncontrol blocks?

GPE

servMap resvMap

endpoint (port) maps

controlIP BWmaps??

RMPrequest allocation

make allocation

RMP Responsibilities• Translate slice MI to local endpoint. Either

call SRM or cache mappings.• Add xsid to subMsg header• Pass through identifiers mapped by SCD:

qid, fid and stats.• Pass through relative queue weights, SCD

maps to global weight.

SCD Responsibilities• Translate slice specific indices to global

indices: qid, fid and stats.• Knows the location of all tables• Interprets commands to add, remove and

modify entries to data path tables.• Knows per slice interface BW allocation and

maps relative queue weight to global weight.• Each interface schedule is assigned (by SRM)

max rate.

xsid:offset

Per interface scheduler and rate limits

NPE Tableid:{addr,BW/Port,copts,fltrs,sram,Qs}...

VLAN mapsrange:{start,end}

vlanid:xsid...

Page 18: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

18WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Queuing and allocating Interface Bandwidth

Page 19: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

19WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

FP slice1

Simple Queuing Example

q1n’

...

Slice Interface and Queue Allocations:{Port, BW, QList}, Qlist = {{qid, weight, threshold},...}

q10

q11

wrr

q2m’

...

FP slice2

q20

q21

NPE

GPE

FP1

GPE

FP2

linkBW

wrr

BW11

BW21

BW11 + BW21 = BW1

BW1

Physical Port (Interface)Attributes:

{ifn, type, ipaddr, linkBW, availBW}

ifn : Interface numbertype: {Internet, Peering}Operations:

get_interfaces()get_ifattrs(ifn)get_ifpeer(ifn)alloc_ifbw(ifn,xsid,bw)

LC

qid in 0...n-1

qid in 0...m-1

ipAddr

Page 20: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

20WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Substrate Message Format

Page 21: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

21WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Substrate Message

mlen: Total message length, including the header.

mid: Message ID, used to support synchronous message processing.

cid: context identifier. Specifies context within which the message is processed. A value of 0 indicates substrate context.

cmd: Command to execute or a return code.

The 4 header fields are each 16 bites.body: 0 or more bytes of command

data.

mlen midcmdcid

body: 0-N(B)

• Assume a simple command response (two-way) messaging framework. But will support one-way schemes..

• Supports asynchronous communications using a message ID.

• The command field is overloaded for the return code.

• Every server is expected to implement a simple Version command (cmd == 0) which return the server’s ID and Version number as two 32-bit fields.– primary use is for monitoring health of

servers and debugging.– All other command values are uniique

only to a particular server.

• Uses UDP as the transport protocol.• All commands are expected to be

idempotent

msgheader

015 015

Page 22: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

22WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Overview• In the interface specifications I provide a c-like description of

the operations and results.• The descriptions are only intended to describe the actual

message format, data fields and returned results. It is not meant to specify an application level library.

• The arguments are to be encoded into the message body in the order that are given, using network byte order (Big Endian) and without padding.

• All commands result in:1. No return response: one-way call semantics2. an error occurs processing the message or command encounters and

unexpected condition or error. In this case the return message will have the error return code in the cmd field.

3. The command completes and does not indicate and error to the message framework then the message result code indicates success. The message body contains any result data.

Page 23: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

23WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

Example Message• Slice with xsid of 0x10 requests the allocation of a global

UDP port (decimal 17) for the local IP address 128.252.130.34 (hex 0x80FC8222).– Assume the alloc_port command ID is 4.

port = alloc_port(0x80FC8222, 0, 17)– Allocate a global UDP (decimal 17) port for the local IP address

128.252.130.34 (hex 0x80FC8222), and let the system assign the next available port number.

• The resource manager allocates port 5050 (0x13BA), the return code of 0 indicates success.

F 1410

80 FC 82 2200 00 11

Command MessageF 1

01080 FC 82 2213 BA 11

Reply Message

Page 24: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

24WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

NAT

Page 25: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

25WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

• Problem:– UDP, TCP: 2 or more GPEs attempt to use same global IP,

Port and Proto– ICMP: ???

Page 26: Washington WASHINGTON UNIVERSITY IN ST LOUIS fredk@arl.wustl.edu Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory.

26WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

min,

,,,

min,,

, ,

j

ji

j

jijji

jj

jjj

jiji

BW

BWMTU

BW

BWWw

BWWMTU

BWBWW

wBW