Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch...

44
Programming The Network Data Plane Changhoon Kim

Transcript of Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch...

Page 1: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

ProgrammingTheNetworkDataPlane

ChanghoonKim

Page 2: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Beautifulideas:Whatifyoucould…• Realizeasmall,butsuper-fastDNScache• PerformTCPSYNauthenticationforbillionsofSYNspersec• Buildareplicatedkey-valuestoreensuringRWopsinafewusecs• Improveyourconsensusserviceperformanceby~100x• BoostyourMemcached cluster’sthroughputby~10x• SpeedupyourDNNtrainingdramaticallybyrealizingparameter

servers

2

… usingswitches inyournetwork?

Page 3: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Youcouldn’tdoanyofthosesofarbecause…

• NoDIY– mustworkwithvendorsatfeature level• Excruciatinglycomplicatedandinvolvedprocesstobuild

consensusandpressureforfeatures• Painfullylongandunpredictableleadtime• Tousenewfeatures,youmustgetnewswitches• Whatyoufinallyget!=whatyouaskedfor

3

Page 4: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Thisisveryunnaturaltodevelopers• Becauseyouallknowhowtorealizeyourownideasby

“programming”CPUs– Programsusedineveryphase(implement,test,anddeploy)– Extremelyfastiterationanddifferentiation– Youownyourownideas– Asustainableecosystemwhereallparticipantsbenefit

4

Canwereplicatethishealthy,sustainableecosystemfornetworking?

Page 5: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Reality:Packetforwardingspeeds

0.1

1

10

100

1000

10000

100000

1990 1995 2000 2005 2010 2015 2020

Switch ChipCPU

5

Gb/s(perchip)

6.4Tb/s

Page 6: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Reality:Packetforwardingspeeds

0.1

1

10

100

1000

10000

100000

1990 1995 2000 2005 2010 2015 2020

Switch ChipCPU

6

80xGb/s(perchip)

6.4Tb/s

Page 7: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

7

Whatdoesatypicalswitchlooklike?

Chip Driver

Run-time API

Protocol Daemons(BGP, OSPF, etc.)

Other MgmtApps

Data plane

Control plane

SwitchOS(Linuxvariant)

PCIe

L2 Forwarding

Table

L3 Routing

Table

ACLTable

A switch is just a Linux box with a high-speed switching chip

packets packets

JustS/W-- Youcanfreelychangethis

Fixed-functionH/W-- There’snothing

youcanchangehere

Page 8: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Networkingsystemshavebeenbuilt“bottoms-up”

SwitchOS

“ThisisroughlyhowIprocesspackets…”

Fixed-functionswitch

inEnglish

API

Page 9: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Turningthetables“top-down”

SwitchOS

“Thisispreciselyhowyoumustprocesspackets”

ProgrammableSwitch

inP4

API

Page 10: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

“Programmableswitchesare10-100xslowerthanfixed-functionswitches.Theycostmore

andconsumemorepower.”Conventionalwisdominnetworking

Page 11: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Evidence:Tofino6.5Tb/sswitch(arrivedDec2016)

Theworld’sfastestand mostprogrammableswitch.Nopower,cost,orpowerpenaltycomparedtofixed-functionswitches.AnincarnationofPISA(ProtocolIndependentSwitchArchitecture)

Page 12: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Domain-specificprocessors

CPU

Computers

JavaCompiler

GPU

Graphics

OpenCLCompiler

DSP

SignalProcessing

MatlabCompiler

MachineLearning

?

TPU

TensorFlow

Compiler

Networking

?

LanguageCompiler>>>

Page 13: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

CPU

Computers

JavaCompiler

GPU

Graphics

OpenCLCompiler

DSP

SignalProcessing

MatlabCompiler

MachineLearning

?

TPU

TensorFlow

Compiler

PISA

Networking

P4Compiler>>>

Domain-specificprocessors

Page 14: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

PISA: An architecture for high-speed programmable packet forwarding

14

Page 15: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

15

Prog

ram

mab

lePa

rser

MatchMemory

ActionALU

PISA: Protocol Independent Switch Architecture

Page 16: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

16

Prog

ram

mab

lePa

rser

PISA: Protocol Independent Switch Architecture

Ingress EgressBuffer

Page 17: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

BufferM M

17

Prog

ram

mab

lePa

rser

PISA: Protocol Independent Switch ArchitectureMatch Logic(Mix of SRAM and TCAM for lookup tables, counters, meters, generic hash tables)

Action Logic(ALUs for standard boolean and arithmetic operations, header modification operations, hashing operations, etc.)

Recirculation

ProgrammablePacketGenerator

CPU (Control plane)

A

A

Ingress match-action stages (pre-switching) Egress match-action stages (post-switching)

Generalization of RMT [sigcomm’13]

Page 18: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Why we call itprotocol-independent packet

processing

18

Page 19: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Logical Data-plane View(your P4 program)Switch Pipeline

Device does not understand any protocols until it gets programmed

Queues

Prog

ram

mab

lePa

rser

Fixe

d Ac

tion

Mat

ch T

able

Mat

ch T

able

Mat

ch T

able

Mat

ch T

able

L2IPv4

IPv6ACL

Actio

n AL

Us

Actio

n AL

Us

Actio

n AL

Us

Actio

n AL

Uspacketpacket packetpacket

CLK19

Page 20: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Mat

ch T

able

Actio

n AL

Us

Mapping logical data-plane design to physical resources

Queues

Mat

ch T

able

Mat

ch T

able

Mat

ch T

able

L2 T

able

IPv4

Tab

le

IPv6

Tab

le

ACL

Tabl

e

Actio

n AL

Us

Actio

n AL

Us

Actio

n AL

Us

L2IPv4

IPv6ACL

Logical Data-plane View (your P4 program)Switch Pipeline

L2IPv6

ACLIPv4

L2 A

ctio

n M

acro

v4 A

ctio

n M

acro

v6 A

ctio

n M

acro

ACL

Actio

n M

acro

Prog

ram

mab

lePa

rser

CLK20

Page 21: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Re-program in the field

L2 T

able

IPv4

Tab

le

ACL

Tabl

e

IPv6

Tab

le

My

Enca

p

L2IPv4

IPv6ACLMyEncap

L2 A

ctio

n M

acro

v4 A

ctio

n M

acro

ACL

Actio

n M

acro

Actio

n

MyEncap

v6 A

ctio

n M

acro

IPv4

Actio

n

IPv4

Actio

n

IPv6

Actio

n

IPv6

Prog

ram

mab

lePa

rser

CLK

Logical Data-plane View (your P4 program)Switch Pipeline

Queues

21

Page 22: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

P

4

:

P

r

o

g

r

a

m

m

i

n

g

P

r

o

t

o

c

o

l

-

I

n

d

e

p

e

n

d

e

n

t

P

a

c

k

e

t

P

r

o

c

e

s

s

o

r

s

P

a

t

B

o

s

s

h

a

r

t

,

D

a

n

D

a

l

y

*,

G

l

e

n

G

i

b

b

,

M

a

r

t

i

n

I

z

z

a

r

d

,

N

i

c

k

M

c

K

e

o

w

n

,

J

e

n

n

i

f

e

r

R

e

x

f

o

r

d

**,

C

o

l

e

S

c

h

l

e

s

i

n

g

e

r

**,

D

a

n

T

a

l

a

y

c

o

,

A

m

i

n

V

a

h

d

a

t

,

G

e

o

r

g

e

V

a

r

g

h

e

s

e

§

,

D

a

v

i

d

W

a

l

k

e

r

**

B

a

r

e

f

o

o

t

N

e

t

w

o

r

k

s *I

n

t

e

l

S

t

a

n

f

o

r

d

U

n

i

v

e

r

s

i

t

y **P

r

i

n

c

e

t

o

n

U

n

i

v

e

r

s

i

t

y ¶

G

o

o

g

l

e §

M

i

c

r

o

s

o

f

t

R

e

s

e

a

r

c

h

ABSTRACTP4 is a high-level language for programming protocol-inde-

pendent packet processors. P4 works in conjunction with

SDN control protocols like OpenFlow. In its current form,

OpenFlow explicitly specifies protocol headers on which it

operates. This set has grown from 12 to 41 fields in a few

years, increasing the complexity of the specification while

still not providing the flexibility to add new headers. In this

paper we propose P4 as a strawman proposal for how Open-

Flow should evolve in the future. We have three goals: (1)

Reconfigurability in the field: Programmers should be able

to change the way switches process packets once they are

deployed. (2) Protocol independence: Switches should not

be tied to any specific network protocols. (3) Target inde-

pendence: Programmers should be able to describe packet-

processing functionality independently of the specifics of the

underlying hardware. As an example, we describe how to

use P4 to configure a switch to add a new hierarchical label.

1. INTRODUCTIONSoftware-Defined Networking (SDN) gives operators pro-

grammatic control over their networks. In SDN, the con-

trol plane is physically separate from the forwarding plane,

and one control plane controls multiple forwarding devices.

While forwarding devices could be programmed in many

ways, having a common, open, vendor-agnostic interface

(like OpenFlow) enables a control plane to control forward-

ing devices from di↵erent hardware and software vendors.

VersionDate

Header Fields

OF 1.0Dec 2009 12 fields (Ethernet, TCP/IPv4)

OF 1.1Feb 2011 15 fields (MPLS, inter-table metadata)

OF 1.2Dec 2011 36 fields (ARP, ICMP, IPv6, etc.)

OF 1.3Jun 2012 40 fields

OF 1.4Oct 2013 41 fields

Table 1: Fields recognized by the OpenFlow standard

The OpenFlow interface started simple, with the abstrac-

tion of a single table of rules that could match packets on a

dozen header fields (e.g., MAC addresses, IP addresses, pro-

tocol, TCP/UDP port numbers, etc.). Over the past five

years, the specification has grown increasingly more com-

plicated (see Table 1), with many more header fields and

multiple stages of rule tables, to allow switches to expose

more of their capabilities to the controller.

The proliferation of new header fields shows no signs of

stopping. For example, data-center network operators in-

creasingly want to apply new forms of packet encapsula-

tion (e.g., NVGRE, VXLAN, and STT), for which they re-

sort to deploying software switches that are easier to extend

with new functionality. Rather than repeatedly extending

the OpenFlow specification, we argue that future switches

should support flexible mechanisms for parsing packets and

matching header fields, allowing controller applications to

leverage these capabilities through a common, open inter-

face (i.e., a new “OpenFlow 2.0” API). Such a general, ex-

tensible approach would be simpler, more elegant, and more

future-proof than today’s OpenFlow 1.x standard.

Figure 1: P4 is a language to configure switches.

Recent chip designs demonstrate that such flexibility can

be achieved in custom ASICs at terabit speeds [1, 2, 3]. Pro-

gramming this new generation of switch chips is far from

easy. Each chip has its own low-level interface, akin to

microcode programming. In this paper, we sketch the de-

sign of a higher-level language for Programming Protocol-

independent Packet Processors (P4). Figure 1 shows the

relationship between P4—used to configure a switch, telling

it how packets are to be processed—and existing APIs (such

as OpenFlow) that are designed to populate the forwarding

tables in fixed function switches. P4 raises the level of ab-

straction for programming the network, and can serve as a

ACM SIGCOMM Computer Communication Review88

Volume 44, Number 3, July 2014

Page 23: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

What does a P4 program look like?

L2IPv4

ACLMyEncapIPv6

header_type ethernet_t {fields {

dstAddr : 48;srcAddr : 48;etherType : 16;

}}

parser parse_ethernet {extract(ethernet);return select(latest.etherType) {

0x8100 : parse_vlan;0x800 : parse_ipv4;0x86DD : parse_ipv6;

}}

TCP

IPv4 IPv6

MyEncapEth

header_type my_encap_t {fields {

foo : 12;bar : 8;baz : 4;qux : 4; next_protocol : 4;

}}

23

Page 24: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

What does a P4 program look like?

L2IPv4

ACLMyEncapIPv6

table ipv4_lpm {

reads {ipv4.dstAddr : lpm;

}actions {

set_next_hop; drop; }

}

action set_next_hop(nhop_ipv4_addr, port) {

modify_field(metadata.nhop_ipv4_addr, nhop_ipv4_addr); modify_field(standard_metadata.egress_port, port); add_to_field(ipv4.ttl, -1);

}

control ingress {

apply(l2);apply(my_encap);if (valid(ipv4) {

apply(ipv4_lpm);} else {

apply(ipv6_lpm);}apply(acl);

}

24

Page 25: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

P4.org (http://p4.org)§ Open-source community to nurture the language

§ Open-source software – Apache license§ A common language: P416

§ Support for various types of devices and targets§ Enable a wealth of innovation

§ Diverse “apps” (including proprietary ones) running on commodity targets

§ With no barrier to entry§ Free of membership fee, free of commitment, and simple licensing

25

Page 26: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

So,whatkindsofexcitingnewopportunitiesarearising?

26

Page 27: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Thenetworkshouldanswerthesequestions

1. “Whichpathdidmypackettake?”2. “Whichrulesdidmypacketfollow?”3. “Howlongdiditqueueateachswitch?”4. “Whodiditsharethequeueswith?”

PISA+P4cananswerallfourquestionsforthefirsttime.Atfulllinerate.Withoutgeneratinganyadditionalpackets!

1234

Page 28: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Log,AnalyzeReplay

Add:SwitchID,ArrivalTime,QueueDelay,MatchedRules,…

OriginalPacket

Visualize

In-bandNetworkTelemetry(INT)Aread-onlyversionofTinyPacketPrograms[sigcomm’14]

Page 29: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

AquickdemoofINT!

29

Page 30: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

30

Page 31: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Whatdoesthismeantoyou?• Improveyourdistributedapps’performancewithtelemetrydata

• Askthefourkeyquestionsregardingyourpacketstonetworkadminsorcloudproviders

• HugeopportunitiesforBig-dataprocessingandmachine-learningexperts

• “Self-driving”networkisnothyperbole

31

Page 32: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

PISA: An architecture for high-speed programmable packet forwarding

32

event processing

Page 33: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Whatwehaveseensofar:Addingnewnetworkingfeatures

1. Newencapsulationsandtunnels2. Newwaystotagpacketsforspecialtreatment3. Newapproachestorouting:e.g.,sourceroutingindata-

centernetworks4. Newapproachestocongestioncontrol5. Newwaystomanipulateandforwardpackets:e.g.splitting

tickersymbolsforhigh-frequencytrading

33

Page 34: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Whatwehaveseensofar:World’sfastestmiddleboxes

1. Layer-4loadconnectionbalancingatTb/s– Replace100sofserversor10sofdedicatedapplianceswithonePISAswitch– Trackandmaintainmappingsfor5~10millionHTTPconnections

2. StatelessfirewallorDDoSdetector– Add/deleteandtrack100softhousandsofnewconnectionspersecond– Includeotherstatelessline-ratefunctions

(e.g.,TCPSYNauthentication,sketches,orBloomfilter-basedwhitelisting)

34

Page 35: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Whatwehaveseensofar:Offloadingpartofcomputingtonetwork

1. DNScache2. Key-valuecache[ACMSOSP’17]

3. Chainreplication4. Paxos [ACMCCR’16]andRAFT5. ParameterserviceforDNNtraining

35

Page 36: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Example:NetCache

Clients

Key-Value Cache

Query Statistics

High-performance Storage Servers

Key-Value Storage RackController

L2/L3 Routing

ToR Switch Data plane

• Non-goal– Maximizethecachehitrate

• Goal– Balancetheworkloadsofbackendserversbyserving

onlyO(NlogN)hotitems-- N isthenumberofbackendservers

– Makethe“fast,small-cache”theoryviableformodernin-memoryKVservers[Fanet.al.,SOCC’11]

• Dataplane– Unmodifiedrouting– Key-valuecachebuiltwithon-chipSRAM– Querystatisticstodetecthotitems

• Controlplane– Updatecachewithhotitemstohandledynamic

workloads

Page 37: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

The“boringlife”ofaNetCache switch

test the switch performance at full traffic load. The valueprocess is executed each time when the packet passes anegress port. To avoid packet size keeps increasing for readqueries, we remove the value field at the last egress stagefor all intermediate ports. The servers can still verify thevalues as they are kept in the two ports connected to them.

• Server rotation for static workloads (§6.3). We use onemachine as a client, and the other as a storage server. Weinstall the hot items in the switch cache as for a full stor-age rack and have the client send traffic according to a Zipfdistribution. For each experiment, the storage server takesone key-value partition and runs as one node in the rack.By rotating the storage server for all 128 partitions (i.e.,performing the experiment for 128 times), we aggregatethe results to obtain the result for the entire rack. Suchresult aggregation is justified by (i) the shared-nothingarchitecture of key-value stores and (ii) the microbench-mark that demonstrates the switch is not the bottleneck.

To find the maximum effective system throughput, wefirst find the bottleneck partition and use that server in thefirst iteration. The client generates queries destined to thisparticular partition, and adjusts its sending rate to controlthe packet loss rate between 0.5% to 1%. This sending rategives the saturated throughput of the bottleneck partition.We obtain the traffic load for the full system based on thissending rate, and use this load to generate per-partitionquery load for remaining partitions. Since the remainingpartitions are not the bottleneck partition, they should beable to fully serve the load. We sum up the throughputs ofall partitions to obtain the aggregate system throughput.

• Server emulation for dynamic workloads (§6.4). Serverrotation is not suitable for evaluating dynamic workloads.This is because we would like to measure the transient be-havior of the system, i.e., how the system performancefluctuates during cache updates, rather than the systemperformance at the stable state. To do this, we emulate128 storage servers on one server by using 128 queues.Each queue processes queries for one key-value partitionand drops queries if the received queries exceed its pro-cessing rate. To evaluate the real-time system throughput,the client tracks the packet loss rate, and adjusts its send-ing rate to keep the loss rate between 0.5% to 1%. Theaggregate throughput is scaled down by a factor of 128.Such emulation is reasonable because in these experimentswe are more interested in the relative performance fluctu-ations when NetCache reacts to workload changes, ratherthan the absolute performance numbers.

6.2 Switch MicrobenchmarkWe first show switch microbenchmark results using snake

test (as described in §6.1). We demonstrate that NetCache isable to run on programmable switches at line rate.

Throughput vs. value size. We populate the switch cachewith 64K items and vary the value size. Two servers and

0 32 64 96 1289alue 6ize (Byte)

0.0

0.5

1.0

1.5

2.0

2.5

Thro

ughS

ut (B

43

6)

(a) Throughput vs. value size. (b) Throughput vs. cache size.

Figure 9: Switch microbenchmark (read and update).one switch are organized to a snake structure. The switchis configured to provide 62 100Gbps ports, and two 40Gbpsports to connect servers. We let the two servers send cacheread and update queries to each other and measure the maxi-mum throughput. Figure 9(a) shows the switch provides 2.24BQPS throughput for value size up to 128 bytes. This is bot-tlenecked by the maximum sending rate of the servers (35MQPS). The Barefoot Tofino switch is able to achieve morethan 4 BQPS. The throughput is not affected by the value sizeor the read/update ratio. This is because the switch ASIC isdesigned to process packets with strict timing requirements.As long as our P4 program is complied to fit the hardwareresources, the data plane can process packets at line rate.

Our current prototype supports value size up to 128 bytes.Bigger values can be supported by using more stages or usingpacket mirroring for a second round of process (§4.4.2).

Throughput vs. cache size. We use 128 bytes as the valuesize and change the cache size. Other settings are the sameas the previous experiment. Similarly, Figure 9(b) shows thatthe throughput keeps at 2.24 BQPS and is not affected by thecache size. Since our current implementation allocates 8 MBmemory for the cache, the cache size cannot be larger than64K for 128-byte values. We note that caching 64K items issufficient for balancing a key-value storage rack.

6.3 System PerformanceWe now present the system performance of a NetCache

key-value storage rack that contains one switch and 128 stor-age servers using server rotation (as described in §6.1).

Throughput. Figure 10(a) shows the system throughput un-der different skewness parameters with read-only queries and10,000 items in the cache. We compare NetCache with No-Cache which does not have the switch cache. In addition,we also show the the portions of the NetCache throughputprovided by the cache and the storage servers respectively.NoCache performs poorly when the workload is skewed.Specifically, with Zipf 0.95 (0.99) distribution, the NoCachethroughput drops down to only 22.5% (15.6%), compared tothe throughput under the uniform workload. By introducingonly a small cache, NetCache effectively reduces the loadimbalances and thus improves the throughput. Overall, Net-Cache improves the throughput by 3.6⇥, 6.5⇥, and 10⇥ overNoCache, under Zipf 0.9, 0.95 and 0.99, respectively.

10

test the switch performance at full traffic load. The valueprocess is executed each time when the packet passes anegress port. To avoid packet size keeps increasing for readqueries, we remove the value field at the last egress stagefor all intermediate ports. The servers can still verify thevalues as they are kept in the two ports connected to them.

• Server rotation for static workloads (§6.3). We use onemachine as a client, and the other as a storage server. Weinstall the hot items in the switch cache as for a full stor-age rack and have the client send traffic according to a Zipfdistribution. For each experiment, the storage server takesone key-value partition and runs as one node in the rack.By rotating the storage server for all 128 partitions (i.e.,performing the experiment for 128 times), we aggregatethe results to obtain the result for the entire rack. Suchresult aggregation is justified by (i) the shared-nothingarchitecture of key-value stores and (ii) the microbench-mark that demonstrates the switch is not the bottleneck.

To find the maximum effective system throughput, wefirst find the bottleneck partition and use that server in thefirst iteration. The client generates queries destined to thisparticular partition, and adjusts its sending rate to controlthe packet loss rate between 0.5% to 1%. This sending rategives the saturated throughput of the bottleneck partition.We obtain the traffic load for the full system based on thissending rate, and use this load to generate per-partitionquery load for remaining partitions. Since the remainingpartitions are not the bottleneck partition, they should beable to fully serve the load. We sum up the throughputs ofall partitions to obtain the aggregate system throughput.

• Server emulation for dynamic workloads (§6.4). Serverrotation is not suitable for evaluating dynamic workloads.This is because we would like to measure the transient be-havior of the system, i.e., how the system performancefluctuates during cache updates, rather than the systemperformance at the stable state. To do this, we emulate128 storage servers on one server by using 128 queues.Each queue processes queries for one key-value partitionand drops queries if the received queries exceed its pro-cessing rate. To evaluate the real-time system throughput,the client tracks the packet loss rate, and adjusts its send-ing rate to keep the loss rate between 0.5% to 1%. Theaggregate throughput is scaled down by a factor of 128.Such emulation is reasonable because in these experimentswe are more interested in the relative performance fluctu-ations when NetCache reacts to workload changes, ratherthan the absolute performance numbers.

6.2 Switch MicrobenchmarkWe first show switch microbenchmark results using snake

test (as described in §6.1). We demonstrate that NetCache isable to run on programmable switches at line rate.

Throughput vs. value size. We populate the switch cachewith 64K items and vary the value size. Two servers and

(a) Throughput vs. value size.

0 16. 32. 48. 64.CacKe 6ize

0.0

0.5

1.0

1.5

2.0

2.5

TKro

ugKS

ut (B

43

6)

(b) Throughput vs. cache size.

Figure 9: Switch microbenchmark (read and update).one switch are organized to a snake structure. The switchis configured to provide 62 100Gbps ports, and two 40Gbpsports to connect servers. We let the two servers send cacheread and update queries to each other and measure the maxi-mum throughput. Figure 9(a) shows the switch provides 2.24BQPS throughput for value size up to 128 bytes. This is bot-tlenecked by the maximum sending rate of the servers (35MQPS). The Barefoot Tofino switch is able to achieve morethan 4 BQPS. The throughput is not affected by the value sizeor the read/update ratio. This is because the switch ASIC isdesigned to process packets with strict timing requirements.As long as our P4 program is complied to fit the hardwareresources, the data plane can process packets at line rate.

Our current prototype supports value size up to 128 bytes.Bigger values can be supported by using more stages or usingpacket mirroring for a second round of process (§4.4.2).

Throughput vs. cache size. We use 128 bytes as the valuesize and change the cache size. Other settings are the sameas the previous experiment. Similarly, Figure 9(b) shows thatthe throughput keeps at 2.24 BQPS and is not affected by thecache size. Since our current implementation allocates 8 MBmemory for the cache, the cache size cannot be larger than64K for 128-byte values. We note that caching 64K items issufficient for balancing a key-value storage rack.

6.3 System PerformanceWe now present the system performance of a NetCache

key-value storage rack that contains one switch and 128 stor-age servers using server rotation (as described in §6.1).

Throughput. Figure 10(a) shows the system throughput un-der different skewness parameters with read-only queries and10,000 items in the cache. We compare NetCache with No-Cache which does not have the switch cache. In addition,we also show the the portions of the NetCache throughputprovided by the cache and the storage servers respectively.NoCache performs poorly when the workload is skewed.Specifically, with Zipf 0.95 (0.99) distribution, the NoCachethroughput drops down to only 22.5% (15.6%), compared tothe throughput under the uniform workload. By introducingonly a small cache, NetCache effectively reduces the loadimbalances and thus improves the throughput. Overall, Net-Cache improves the throughput by 3.6⇥, 6.5⇥, and 10⇥ overNoCache, under Zipf 0.9, 0.95 and 0.99, respectively.

10

Onecanfurtherincreasethevaluesizeswithmorestages,recirculation,ormirroring.

Yes,it’sBillionQueriesPerSec,notatypoJ

Page 38: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Andits“notsoboring”benefits

NetCacheprovides3-10xthroughputimprovements.

Throughput of a key-value storage rack with one Tofino switch and 128 storage servers.

uQiforP ziSf-0.9 ziSf-0.95 ziSf-0.99WorNloDd DisWribuWioQ

0.0

0.5

1.0

1.5

2.0

Thro

ughS

uW (B

QP

S)

1oCDche 1eWCDche(servers) 1eWCDche(cDche)

Page 39: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

NetCache is a key-value store that leverages

&Billions of queries/sec a few usec latency

even under

workloads.

&highly-skewed rapidly-changing

in-network caching to achieve

Page 40: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Summingitup…

40

Page 41: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Whydata-planeprogramming?1. Newfeatures:Realizeyourbeautifulideasveryquickly2. Reducecomplexity:Removeunnecessaryfeaturesandtables3. EfficientuseofH/Wresources:Achievebiggestbangforbuck4. Greatervisibility:Newdiagnostics,telemetry,OAM,etc.5. Modularity:Composeforwardingbehaviorfromlibraries6. Portability:Specifyforwardingbehavioronce;compiletomanydevices7. Ownyourownideas:Noneedtoshareyourideaswithothers

“Protocols are being lifted off chips and into software”– Ben Horowitz

41

Page 42: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

• PISAandP4:Thefirstattempttodefineamachinearchitectureandprogrammingmodelsfornetworkinginadisciplinedway

• Networkisbecomingyetanotherprogrammableplatform

• It’sfuntofigureoutthebestworkloadsforthisnewmachinearchitecture

42

Myobservations

Page 43: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Wanttofindmoreresourcesorfollowup?• Visithttp://p4.org andhttp://github.com/p4lang

– P4languagespec– P4devtoolsandsampleprograms– P4tutorials

• JoinP4workshopsandP4developers’days• ParticipateinP4workinggroupactivities

– Language,targetarchitecture,runtimeAPI,applications• Needmoreexpertiseacrossvariousfieldsincomputerscience

– ToenhancePISA,P4,devtools(e.g.,forformalverification,equivalencecheck,andmanymore…)

43

Page 44: Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

Thanks.Let’sdevelopyourbeautifulideas

inP4!

44