Post on 13-Jan-2016
description
1
Router Construction II
OutlineNetwork ProcessorsAdding ExtensionsScheduling Cycles
2
Observations
• Emerging commodity components can be used to build IP routers– switching fabrics, network processors,
…
• Routers are being asked to support a growing array of services– firewalls, proxies, p2p nets, overlays, ...
3
Data Plane(IP)
Control Plane(BGP, RSVP…)
Router Architecture
4
Software-Based Router
+ Cost
+ Programmability
– Performance (~300 Kpps)
– RobustnessData Plane(IP)
Control Plane(BGP, RSVP…)
PC
5
Hardware-Based Router
– Cost
– Programmability
+ Performance (25+ Mpps)
+ RobustnessData Plane(IP)
Control Plane(BGP, RSVP…)
ASIC
PC
6
NP-Based Router Architecture
+ Cost ($1500)
+ Programmability
? Performance
? RobustnessData Plane(packet flows)
Control Plane(packet flows)
IXP1200
PC
7
In General...
IXP
Pentium
IXP
Pentium
...
Pentium
8
Architectural Overview
. . . Network Services . . .
VirtualRouter
. . . Hardware Configurations . . .
Packet Flows
Forwarding Paths
Switching Paths
9
Virtual Router
• Classifiers
• Schedulers
• Forwarders
10
Simple Example
IP
Proxy
Active Protocol
11
Intel IXP
Scratch
DRAM
SRAM
6 Micro-Engines
StrongARM
FIFOs
IX B
us
MA
C P
orts
IXP1200 Chip
PC
I B
us
12
Processor Hierarchy
MicroEngines
Pentium
StrongArm
13
Data Plane Pipeline
DRAM(buffers)
SRAM(queues,
state)
InputFIFO Slots
OutputFIFO Slots
InputContexts
OutputContexts
64B
14
Data Plane Processing
INPUT context loopwait_for_datacopy in_fiforegsBasic_IP_processingcopy regsDRAMif (last_fragment)
enqueueSRAM
OUTPUT context loop
if (need_data)
select_queue
dequeueSRAMcopy DRAMout_fifo
15
0123456789
0 4 8 16 24
MicroEngine Contexts
For
war
ding
Rat
e (M
pps)
input output
Pipeline Evaluation
100Mbps Ether 0.142Mpps
Measured independently
16
What We Measured
• Static context assignment– 16 input / 8 output
• Infinite offered load• 64-byte (minimum-sized) IP packets• Three different queuing disciplines
17
Single Protected Queue
• Lock synchronization• Max 3.47 Mpps• Contention lower bound 1.67 Mpps
O
I
I
I
Output FIFO
18
Multiple Private Queues
• Output must select queue• Max 3.29 Mpps
O
I
I
I
Output FIFO
19
Multiple Protected Queues
• Output must select queue• Some QoS scheduling (16 priority levels)• Max 3.29 Mpps
O
I
I
I
Output FIFO
20
Data Plane ProcessingINPUT context loopwait_for_datacopy in_fiforegsBasic_IP_processingcopy regsDRAMif (last_fragment)
enqueueSRAM
OUTPUT context loop
if (need_data)
select_queue
dequeueSRAMcopy DRAMout_fifo
21
Cycles to WasteINPUT context loopwait_for_datacopy in_fiforegsBasic_IP_processingnopnop…nopcopy regsDRAMif (last_fragment)
enqueueSRAM
OUTPUT context loop
if (need_data)
select_queue
dequeueSRAMcopy DRAMout_fifo
22
How Many “NOPs” Possible?
00.5
11.5
22.5
33.5
0/0 40/4 80/8 160/16 320/32 640/64
Extra Register/SRAM Operations
For
war
din
g R
ate
(Mp
ps)
IXP1200 Evalualtion Board1.2 Mpps = 8x100Mbps
23
Data Plane Extensions
Processing Memory Ops Register Ops
Basic IP 6 32
TCP Splicer 6 45
TCP SYN Monitor 1 5
ACK Monitor 3 15
Port Filter 5 26
Wavelet Dropper 2 28
24
Control and Data Plane
Smart Dropper
Layered Video Analysis(control plane)
(data plane)
Shared State
25
What About the StrongARM?
• Shares memory bus with MicroEngines– must respect resource budget
• What we do– control IXP1200 Pentium DMA– control MicroEngines
• What might be possible– anything within budget– exploit instruction and data caches
• We recommend against– running Linux
26
Performance
MicroEngines
Pentium
StrongArm
310Kpps with1510 cycles/packet
3.47Mpps w/ no VRP or1.13Mpps w/ VRP buget
27
Pentium
• Runs protocols in the control plane– e.g., BGP, OSPF, RSVP
• Run other router extensions– e.g., proxies, active protocols,
overlays
• Implementation– runs Scout OS + Linux IXP driver– CPU scheduler is key
28
Processes...
.
.
.
.
.
.
.
.
.
Input Port
Pentium
Output PortP
P
PPP
29
Performance
0
50
100
150
200
250
300
350
0 50 100 150 200 250 300 350 400
Aggregate Offered Load (Kpps)
Aggre
gate
Forw
ard
ing R
ate
(K
pps)
Interrupt
Polling, 1 Process
Polling, 2 Process
Polling, 3 Process
Polling, 3 Process,w/ o Batching
30
Performance (cont)
0
50
100
150
200
250
300
IP - -IP ++
Active IP (native)
Active IP (Java)
Transparent Proxy
Classic Proxy
450MHz P-I ICisco 7200
Kpps
31
Scheduling Mechanism• Proportional share forms the base
– each process reserves a cycle rate– provides isolation between processes– unused capacity fairly distributed
• Eligibility– a process receives its share only when its
source queue is not empty and sink queue is not full
• Batching– to minimize context switch overhead
32
Share Assignment
• QoS Flows– assume link rate is given, derive cycle rate– conservative rate to input process– keep batching level low
• Best Effort Flows– may be influenced by admin policy– use shares to balance system (avoid livelock)– keep batching level high
33
Experiment
A (BE)
B (QoS)
C (QoS)
A + C
B
34
Mixing Best Effort and QoS
0
50
100
150
200
250
0 20 40 60 80 100 120 140
Flow A Offered Load (Kpps)
Forw
ard
ing R
ate
(Kpps)
Aggreate
Flow A (BE)
Flow B(QoS,90Kpps)
Flow C(QoS,90Kpps)
• Increase offered load from A
35
CPU vs Link
0
50
100
150
200
250
0 10 20 30 40 50 60
Flow A Additional Processing Delay (ms)
Forw
ard
ing R
ate
(Kpps)
Aggreate
Flow A (BE)
Flow B(QoS,90Kpps)
Flow C(QoS,90Kpps)
• Fix A at 50Kpps, increase its processing cost
36
Turn Batching Off
0
50
100
150
200
250
0 10 20 30 40 50 60
Flow A Additional Processing Delay (us)
Forw
ard
ing R
ate
(Kpps)
Aggreate
Flow A (BE)
Flow B(QoS,90Kpps)
Flow C(QoS,90Kpps)
• CPU efficiency: 66.2%
37
Enforce Time Slice
0
50
100
150
200
250
0 10 20 30 40 50 60
Flow A Additional Processing Delay (us)
Forw
ard
ing R
ate
(Kpps)
Aggreate
Flow A (BE)
Flow B(QoS,90Kpps)
Flow C(QoS,90Kpps)
• CPU efficiency: 81.6% (30us quantum)
38
Batching Throttle• Scheduler Granularity: G
– flow processes as many packets as possible w/in G
• Efficiency Index: E, Overhead Threshold: T– keep the overhead under T%, then 1 / (1+T) < E
• Batch Threshold: Bi– don’t consider Flow i active until it has
accumulated at least Bi packets, where Csw / (Bi x
Ci) < T
• Delay Threshold: Di– consider a flow that has waited Di active
39
Dynamic Control
• Flow specifies delay requirement D• Measure context switch overhead
offline• Record average flow runtime• Set E based on workload• Calculate batch-level B for flow
40
Packet Trace
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
0 5 10 15 20 25 30 35 40 45 50 55
Serv
ice T
ime (
us)
WFQ
T=10%D=1000
T=25%D=1000
T=10%D=500
T=10%D=300