Download - Programming The Network Data Plane - QCon · Buffer M M 17 e r PISA: Protocol Independent Switch Architecture Match Logic (Mix of SRAM and TCAM for lookup tables, counters, meters,

ProgrammingTheNetworkDataPlane

ChanghoonKim

Beautifulideas:Whatifyoucould…• Realizeasmall,butsuper-fastDNScache• PerformTCPSYNauthenticationforbillionsofSYNspersec• Buildareplicatedkey-valuestoreensuringRWopsinafewusecs• Improveyourconsensusserviceperformanceby~100x• BoostyourMemcached cluster’sthroughputby~10x• SpeedupyourDNNtrainingdramaticallybyrealizingparameter

servers

2

… usingswitches inyournetwork?

Youcouldn’tdoanyofthosesofarbecause…

• NoDIY– mustworkwithvendorsatfeature level• Excruciatinglycomplicatedandinvolvedprocesstobuild

consensusandpressureforfeatures• Painfullylongandunpredictableleadtime• Tousenewfeatures,youmustgetnewswitches• Whatyoufinallyget!=whatyouaskedfor

3

Thisisveryunnaturaltodevelopers• Becauseyouallknowhowtorealizeyourownideasby

“programming”CPUs– Programsusedineveryphase(implement,test,anddeploy)– Extremelyfastiterationanddifferentiation– Youownyourownideas– Asustainableecosystemwhereallparticipantsbenefit

4

Canwereplicatethishealthy,sustainableecosystemfornetworking?

Reality:Packetforwardingspeeds

0.1

1

10

100

1000

10000

100000

1990 1995 2000 2005 2010 2015 2020

Switch ChipCPU

5

Gb/s(perchip)

6.4Tb/s

Reality:Packetforwardingspeeds

0.1

1

10

100

1000

10000

100000

1990 1995 2000 2005 2010 2015 2020

Switch ChipCPU

6

80xGb/s(perchip)

6.4Tb/s

7

Whatdoesatypicalswitchlooklike?

Chip Driver

Run-time API

Protocol Daemons(BGP, OSPF, etc.)

Other MgmtApps

Data plane

Control plane

SwitchOS(Linuxvariant)

PCIe

…

L2 Forwarding

Table

L3 Routing

Table

ACLTable

…

A switch is just a Linux box with a high-speed switching chip

packets packets

JustS/W-- Youcanfreelychangethis

Fixed-functionH/W-- There’snothing

youcanchangehere

Networkingsystemshavebeenbuilt“bottoms-up”

SwitchOS

“ThisisroughlyhowIprocesspackets…”

Fixed-functionswitch

inEnglish

API

Turningthetables“top-down”

SwitchOS

“Thisispreciselyhowyoumustprocesspackets”

ProgrammableSwitch

inP4

API

“Programmableswitchesare10-100xslowerthanfixed-functionswitches.Theycostmore

andconsumemorepower.”Conventionalwisdominnetworking

Evidence:Tofino6.5Tb/sswitch(arrivedDec2016)

Theworld’sfastestand mostprogrammableswitch.Nopower,cost,orpowerpenaltycomparedtofixed-functionswitches.AnincarnationofPISA(ProtocolIndependentSwitchArchitecture)

Domain-specificprocessors

CPU

Computers

JavaCompiler

GPU

Graphics

OpenCLCompiler

DSP

SignalProcessing

MatlabCompiler

MachineLearning

?

TPU

TensorFlow

Compiler

Networking

?

LanguageCompiler>>>

CPU

Computers

JavaCompiler

GPU

Graphics

OpenCLCompiler

DSP

SignalProcessing

MatlabCompiler

MachineLearning

?

TPU

TensorFlow

Compiler

PISA

Networking

P4Compiler>>>

Domain-specificprocessors

PISA: An architecture for high-speed programmable packet forwarding

14

15

Prog

ram

mab

lePa

rser

MatchMemory

ActionALU

PISA: Protocol Independent Switch Architecture

16

Prog

ram

mab

lePa

rser

PISA: Protocol Independent Switch Architecture

Ingress EgressBuffer

BufferM M

17

Prog

ram

mab

lePa

rser

PISA: Protocol Independent Switch ArchitectureMatch Logic(Mix of SRAM and TCAM for lookup tables, counters, meters, generic hash tables)

Action Logic(ALUs for standard boolean and arithmetic operations, header modification operations, hashing operations, etc.)

Recirculation

ProgrammablePacketGenerator

CPU (Control plane)

A

…

A

…

Ingress match-action stages (pre-switching) Egress match-action stages (post-switching)

Generalization of RMT [sigcomm’13]

Why we call itprotocol-independent packet

processing

18

Logical Data-plane View(your P4 program)Switch Pipeline

Device does not understand any protocols until it gets programmed

Queues

Prog

ram

mab

lePa

rser

Fixe

d Ac

tion

Mat

ch T

able

Mat

ch T

able

Mat

ch T

able

Mat

ch T

able

L2IPv4

IPv6ACL

Actio

n AL

Us

Actio

n AL

Us

Actio

n AL

Us

Actio

n AL

Uspacketpacket packetpacket

CLK19

Mat

ch T

able

Actio

n AL

Us

Mapping logical data-plane design to physical resources

Queues

Mat

ch T

able

Mat

ch T

able

Mat

ch T

able

L2 T

able

IPv4

Tab

le

IPv6

Tab

le

ACL

Tabl

e

Actio

n AL

Us

Actio

n AL

Us

Actio

n AL

Us

L2IPv4

IPv6ACL

Logical Data-plane View (your P4 program)Switch Pipeline

L2IPv6

ACLIPv4

L2 A

ctio

n M

acro

v4 A

ctio

n M

acro

v6 A

ctio

n M

acro

ACL

Actio

n M

acro

Prog

ram

mab

lePa

rser

CLK20

Re-program in the field

L2 T

able

IPv4

Tab

le

ACL

Tabl

e

IPv6

Tab

le

My

Enca

p

L2IPv4

IPv6ACLMyEncap

L2 A

ctio

n M

acro

v4 A

ctio

n M

acro

ACL

Actio

n M

acro

Actio

n

MyEncap

v6 A

ctio

n M

acro

IPv4

Actio

n

IPv4

Actio

n

IPv6

Actio

n

IPv6

Prog

ram

mab

lePa

rser

CLK

Logical Data-plane View (your P4 program)Switch Pipeline

Queues

21

P

4

:

P

r

o

g

r

a

m

m

i

n

g

P

r

o

t

o

c

o

l

-

I

n

d

e

p

e

n

d

e

n

t

P

a

c

k

e

t

P

r

o

c

e

s

s

o

r

s

P

a

t

B

o

s

s

h

a

r

t

†

,

D

a

n

D

a

l

y

*,

G

l

e

n

G

i

b

b

†

,

M

a

r

t

i

n

I

z

z

a

r

d

†

,

N

i

c

k

M

c

K

e

o

w

n

‡

,

J

e

n

n

i

f

e

r

R

e

x

f

o

r

d

**,

C

o

l

e

S

c

h

l

e

s

i

n

g

e

r

**,

D

a

n

T

a

l

a

y

c

o

†

,

A

m

i

n

V

a

h

d

a

t

¶

,

G

e

o

r

g

e

V

a

r

g

h

e

s

e

§

,

D

a

v

i

d

W

a

l

k

e

r

**

†

B

a

r

e

f

o

o

t

N

e

t

w

o

r

k

s *I

n

t

e

l

‡

S

t

a

n

f

o

r

d

U

n

i

v

e

r

s

i

t

y **P

r

i

n

c

e

t

o

n

U

n

i

v

e

r

s

i

t

y ¶

G

o

o

g

l

e §

M

i

c

r

o

s

o

f

t

R

e

s

e

a

r

c

h

ABSTRACTP4 is a high-level language for programming protocol-inde-

pendent packet processors. P4 works in conjunction with

SDN control protocols like OpenFlow. In its current form,

OpenFlow explicitly specifies protocol headers on which it

operates. This set has grown from 12 to 41 fields in a few

years, increasing the complexity of the specification while

still not providing the flexibility to add new headers. In this

paper we propose P4 as a strawman proposal for how Open-

Flow should evolve in the future. We have three goals: (1)

Reconfigurability in the field: Programmers should be able

to change the way switches process packets once they are

deployed. (2) Protocol independence: Switches should not

be tied to any specific network protocols. (3) Target inde-

pendence: Programmers should be able to describe packet-

processing functionality independently of the specifics of the

underlying hardware. As an example, we describe how to

use P4 to configure a switch to add a new hierarchical label.

1. INTRODUCTIONSoftware-Defined Networking (SDN) gives operators pro-

grammatic control over their networks. In SDN, the con-

trol plane is physically separate from the forwarding plane,

and one control plane controls multiple forwarding devices.

While forwarding devices could be programmed in many

ways, having a common, open, vendor-agnostic interface

(like OpenFlow) enables a control plane to control forward-

ing devices from di↵erent hardware and software vendors.

VersionDate

Header Fields

OF 1.0Dec 2009 12 fields (Ethernet, TCP/IPv4)

OF 1.1Feb 2011 15 fields (MPLS, inter-table metadata)

OF 1.2Dec 2011 36 fields (ARP, ICMP, IPv6, etc.)

OF 1.3Jun 2012 40 fields

OF 1.4Oct 2013 41 fields

Table 1: Fields recognized by the OpenFlow standard

The OpenFlow interface started simple, with the abstrac-

tion of a single table of rules that could match packets on a

dozen header fields (e.g., MAC addresses, IP addresses, pro-

tocol, TCP/UDP port numbers, etc.). Over the past five

years, the specification has grown increasingly more com-

plicated (see Table 1), with many more header fields and

multiple stages of rule tables, to allow switches to expose

more of their capabilities to the controller.

The proliferation of new header fields shows no signs of

stopping. For example, data-center network operators in-

creasingly want to apply new forms of packet encapsula-

tion (e.g., NVGRE, VXLAN, and STT), for which they re-

sort to deploying software switches that are easier to extend

with new functionality. Rather than repeatedly extending

the OpenFlow specification, we argue that future switches

should support flexible mechanisms for parsing packets and

matching header fields, allowing controller applications to

leverage these capabilities through a common, open inter-

face (i.e., a new “OpenFlow 2.0” API). Such a general, ex-

tensible approach would be simpler, more elegant, and more

future-proof than today’s OpenFlow 1.x standard.

Figure 1: P4 is a language to configure switches.

Recent chip designs demonstrate that such flexibility can

be achieved in custom ASICs at terabit speeds [1, 2, 3]. Pro-

gramming this new generation of switch chips is far from

easy. Each chip has its own low-level interface, akin to

microcode programming. In this paper, we sketch the de-

sign of a higher-level language for Programming Protocol-

independent Packet Processors (P4). Figure 1 shows the

relationship between P4—used to configure a switch, telling

it how packets are to be processed—and existing APIs (such

as OpenFlow) that are designed to populate the forwarding

tables in fixed function switches. P4 raises the level of ab-

straction for programming the network, and can serve as a

ACM SIGCOMM Computer Communication Review88

Volume 44, Number 3, July 2014

What does a P4 program look like?

L2IPv4

ACLMyEncapIPv6

header_type ethernet_t {fields {

dstAddr : 48;srcAddr : 48;etherType : 16;

}}

parser parse_ethernet {extract(ethernet);return select(latest.etherType) {

0x8100 : parse_vlan;0x800 : parse_ipv4;0x86DD : parse_ipv6;

}}

TCP

IPv4 IPv6

MyEncapEth

header_type my_encap_t {fields {

foo : 12;bar : 8;baz : 4;qux : 4; next_protocol : 4;

}}

23

What does a P4 program look like?

L2IPv4

ACLMyEncapIPv6

table ipv4_lpm {

reads {ipv4.dstAddr : lpm;

}actions {

set_next_hop; drop; }

}

action set_next_hop(nhop_ipv4_addr, port) {

modify_field(metadata.nhop_ipv4_addr, nhop_ipv4_addr); modify_field(standard_metadata.egress_port, port); add_to_field(ipv4.ttl, -1);

}

control ingress {

apply(l2);apply(my_encap);if (valid(ipv4) {

apply(ipv4_lpm);} else {

apply(ipv6_lpm);}apply(acl);

}

24

P4.org (http://p4.org)§ Open-source community to nurture the language

§ Open-source software – Apache license§ A common language: P416

§ Support for various types of devices and targets§ Enable a wealth of innovation

§ Diverse “apps” (including proprietary ones) running on commodity targets

§ With no barrier to entry§ Free of membership fee, free of commitment, and simple licensing

25

So,whatkindsofexcitingnewopportunitiesarearising?

26

Thenetworkshouldanswerthesequestions

1. “Whichpathdidmypackettake?”2. “Whichrulesdidmypacketfollow?”3. “Howlongdiditqueueateachswitch?”4. “Whodiditsharethequeueswith?”

PISA+P4cananswerallfourquestionsforthefirsttime.Atfulllinerate.Withoutgeneratinganyadditionalpackets!

1234

Log,AnalyzeReplay

Add:SwitchID,ArrivalTime,QueueDelay,MatchedRules,…

OriginalPacket

Visualize

In-bandNetworkTelemetry(INT)Aread-onlyversionofTinyPacketPrograms[sigcomm’14]

AquickdemoofINT!

29

Whatdoesthismeantoyou?• Improveyourdistributedapps’performancewithtelemetrydata

• Askthefourkeyquestionsregardingyourpacketstonetworkadminsorcloudproviders

• HugeopportunitiesforBig-dataprocessingandmachine-learningexperts

• “Self-driving”networkisnothyperbole

31

PISA: An architecture for high-speed programmable packet forwarding

32

event processing

Whatwehaveseensofar:Addingnewnetworkingfeatures

1. Newencapsulationsandtunnels2. Newwaystotagpacketsforspecialtreatment3. Newapproachestorouting:e.g.,sourceroutingindata-

centernetworks4. Newapproachestocongestioncontrol5. Newwaystomanipulateandforwardpackets:e.g.splitting

tickersymbolsforhigh-frequencytrading

33

Whatwehaveseensofar:World’sfastestmiddleboxes

1. Layer-4loadconnectionbalancingatTb/s– Replace100sofserversor10sofdedicatedapplianceswithonePISAswitch– Trackandmaintainmappingsfor5~10millionHTTPconnections

2. StatelessfirewallorDDoSdetector– Add/deleteandtrack100softhousandsofnewconnectionspersecond– Includeotherstatelessline-ratefunctions

(e.g.,TCPSYNauthentication,sketches,orBloomfilter-basedwhitelisting)

34

Whatwehaveseensofar:Offloadingpartofcomputingtonetwork

1. DNScache2. Key-valuecache[ACMSOSP’17]

3. Chainreplication4. Paxos [ACMCCR’16]andRAFT5. ParameterserviceforDNNtraining

35

Example:NetCache

Clients

Key-Value Cache

Query Statistics

High-performance Storage Servers

Key-Value Storage RackController

L2/L3 Routing

ToR Switch Data plane

• Non-goal– Maximizethecachehitrate

• Goal– Balancetheworkloadsofbackendserversbyserving

onlyO(NlogN)hotitems-- N isthenumberofbackendservers

– Makethe“fast,small-cache”theoryviableformodernin-memoryKVservers[Fanet.al.,SOCC’11]

• Dataplane– Unmodifiedrouting– Key-valuecachebuiltwithon-chipSRAM– Querystatisticstodetecthotitems

• Controlplane– Updatecachewithhotitemstohandledynamic

workloads

The“boringlife”ofaNetCache switch

test the switch performance at full traffic load. The valueprocess is executed each time when the packet passes anegress port. To avoid packet size keeps increasing for readqueries, we remove the value field at the last egress stagefor all intermediate ports. The servers can still verify thevalues as they are kept in the two ports connected to them.

• Server rotation for static workloads (§6.3). We use onemachine as a client, and the other as a storage server. Weinstall the hot items in the switch cache as for a full stor-age rack and have the client send traffic according to a Zipfdistribution. For each experiment, the storage server takesone key-value partition and runs as one node in the rack.By rotating the storage server for all 128 partitions (i.e.,performing the experiment for 128 times), we aggregatethe results to obtain the result for the entire rack. Suchresult aggregation is justified by (i) the shared-nothingarchitecture of key-value stores and (ii) the microbench-mark that demonstrates the switch is not the bottleneck.

To find the maximum effective system throughput, wefirst find the bottleneck partition and use that server in thefirst iteration. The client generates queries destined to thisparticular partition, and adjusts its sending rate to controlthe packet loss rate between 0.5% to 1%. This sending rategives the saturated throughput of the bottleneck partition.We obtain the traffic load for the full system based on thissending rate, and use this load to generate per-partitionquery load for remaining partitions. Since the remainingpartitions are not the bottleneck partition, they should beable to fully serve the load. We sum up the throughputs ofall partitions to obtain the aggregate system throughput.

• Server emulation for dynamic workloads (§6.4). Serverrotation is not suitable for evaluating dynamic workloads.This is because we would like to measure the transient be-havior of the system, i.e., how the system performancefluctuates during cache updates, rather than the systemperformance at the stable state. To do this, we emulate128 storage servers on one server by using 128 queues.Each queue processes queries for one key-value partitionand drops queries if the received queries exceed its pro-cessing rate. To evaluate the real-time system throughput,the client tracks the packet loss rate, and adjusts its send-ing rate to keep the loss rate between 0.5% to 1%. Theaggregate throughput is scaled down by a factor of 128.Such emulation is reasonable because in these experimentswe are more interested in the relative performance fluctu-ations when NetCache reacts to workload changes, ratherthan the absolute performance numbers.

6.2 Switch MicrobenchmarkWe first show switch microbenchmark results using snake

test (as described in §6.1). We demonstrate that NetCache isable to run on programmable switches at line rate.

Throughput vs. value size. We populate the switch cachewith 64K items and vary the value size. Two servers and

0 32 64 96 1289alue 6ize (Byte)

0.0

0.5

1.0

1.5

2.0

2.5

Thro

ughS

ut (B

43

6)

(a) Throughput vs. value size. (b) Throughput vs. cache size.

Figure 9: Switch microbenchmark (read and update).one switch are organized to a snake structure. The switchis configured to provide 62 100Gbps ports, and two 40Gbpsports to connect servers. We let the two servers send cacheread and update queries to each other and measure the maxi-mum throughput. Figure 9(a) shows the switch provides 2.24BQPS throughput for value size up to 128 bytes. This is bot-tlenecked by the maximum sending rate of the servers (35MQPS). The Barefoot Tofino switch is able to achieve morethan 4 BQPS. The throughput is not affected by the value sizeor the read/update ratio. This is because the switch ASIC isdesigned to process packets with strict timing requirements.As long as our P4 program is complied to fit the hardwareresources, the data plane can process packets at line rate.

Our current prototype supports value size up to 128 bytes.Bigger values can be supported by using more stages or usingpacket mirroring for a second round of process (§4.4.2).

Throughput vs. cache size. We use 128 bytes as the valuesize and change the cache size. Other settings are the sameas the previous experiment. Similarly, Figure 9(b) shows thatthe throughput keeps at 2.24 BQPS and is not affected by thecache size. Since our current implementation allocates 8 MBmemory for the cache, the cache size cannot be larger than64K for 128-byte values. We note that caching 64K items issufficient for balancing a key-value storage rack.

6.3 System PerformanceWe now present the system performance of a NetCache

key-value storage rack that contains one switch and 128 stor-age servers using server rotation (as described in §6.1).

Throughput. Figure 10(a) shows the system throughput un-der different skewness parameters with read-only queries and10,000 items in the cache. We compare NetCache with No-Cache which does not have the switch cache. In addition,we also show the the portions of the NetCache throughputprovided by the cache and the storage servers respectively.NoCache performs poorly when the workload is skewed.Specifically, with Zipf 0.95 (0.99) distribution, the NoCachethroughput drops down to only 22.5% (15.6%), compared tothe throughput under the uniform workload. By introducingonly a small cache, NetCache effectively reduces the loadimbalances and thus improves the throughput. Overall, Net-Cache improves the throughput by 3.6⇥, 6.5⇥, and 10⇥ overNoCache, under Zipf 0.9, 0.95 and 0.99, respectively.

10

test the switch performance at full traffic load. The valueprocess is executed each time when the packet passes anegress port. To avoid packet size keeps increasing for readqueries, we remove the value field at the last egress stagefor all intermediate ports. The servers can still verify thevalues as they are kept in the two ports connected to them.

• Server rotation for static workloads (§6.3). We use onemachine as a client, and the other as a storage server. Weinstall the hot items in the switch cache as for a full stor-age rack and have the client send traffic according to a Zipfdistribution. For each experiment, the storage server takesone key-value partition and runs as one node in the rack.By rotating the storage server for all 128 partitions (i.e.,performing the experiment for 128 times), we aggregatethe results to obtain the result for the entire rack. Suchresult aggregation is justified by (i) the shared-nothingarchitecture of key-value stores and (ii) the microbench-mark that demonstrates the switch is not the bottleneck.

To find the maximum effective system throughput, wefirst find the bottleneck partition and use that server in thefirst iteration. The client generates queries destined to thisparticular partition, and adjusts its sending rate to controlthe packet loss rate between 0.5% to 1%. This sending rategives the saturated throughput of the bottleneck partition.We obtain the traffic load for the full system based on thissending rate, and use this load to generate per-partitionquery load for remaining partitions. Since the remainingpartitions are not the bottleneck partition, they should beable to fully serve the load. We sum up the throughputs ofall partitions to obtain the aggregate system throughput.

• Server emulation for dynamic workloads (§6.4). Serverrotation is not suitable for evaluating dynamic workloads.This is because we would like to measure the transient be-havior of the system, i.e., how the system performancefluctuates during cache updates, rather than the systemperformance at the stable state. To do this, we emulate128 storage servers on one server by using 128 queues.Each queue processes queries for one key-value partitionand drops queries if the received queries exceed its pro-cessing rate. To evaluate the real-time system throughput,the client tracks the packet loss rate, and adjusts its send-ing rate to keep the loss rate between 0.5% to 1%. Theaggregate throughput is scaled down by a factor of 128.Such emulation is reasonable because in these experimentswe are more interested in the relative performance fluctu-ations when NetCache reacts to workload changes, ratherthan the absolute performance numbers.

6.2 Switch MicrobenchmarkWe first show switch microbenchmark results using snake

test (as described in §6.1). We demonstrate that NetCache isable to run on programmable switches at line rate.

Throughput vs. value size. We populate the switch cachewith 64K items and vary the value size. Two servers and

(a) Throughput vs. value size.

0 16. 32. 48. 64.CacKe 6ize

0.0

0.5

1.0

1.5

2.0

2.5

TKro

ugKS

ut (B

43

6)

(b) Throughput vs. cache size.

Figure 9: Switch microbenchmark (read and update).one switch are organized to a snake structure. The switchis configured to provide 62 100Gbps ports, and two 40Gbpsports to connect servers. We let the two servers send cacheread and update queries to each other and measure the maxi-mum throughput. Figure 9(a) shows the switch provides 2.24BQPS throughput for value size up to 128 bytes. This is bot-tlenecked by the maximum sending rate of the servers (35MQPS). The Barefoot Tofino switch is able to achieve morethan 4 BQPS. The throughput is not affected by the value sizeor the read/update ratio. This is because the switch ASIC isdesigned to process packets with strict timing requirements.As long as our P4 program is complied to fit the hardwareresources, the data plane can process packets at line rate.

Our current prototype supports value size up to 128 bytes.Bigger values can be supported by using more stages or usingpacket mirroring for a second round of process (§4.4.2).

Throughput vs. cache size. We use 128 bytes as the valuesize and change the cache size. Other settings are the sameas the previous experiment. Similarly, Figure 9(b) shows thatthe throughput keeps at 2.24 BQPS and is not affected by thecache size. Since our current implementation allocates 8 MBmemory for the cache, the cache size cannot be larger than64K for 128-byte values. We note that caching 64K items issufficient for balancing a key-value storage rack.

6.3 System PerformanceWe now present the system performance of a NetCache

key-value storage rack that contains one switch and 128 stor-age servers using server rotation (as described in §6.1).

Throughput. Figure 10(a) shows the system throughput un-der different skewness parameters with read-only queries and10,000 items in the cache. We compare NetCache with No-Cache which does not have the switch cache. In addition,we also show the the portions of the NetCache throughputprovided by the cache and the storage servers respectively.NoCache performs poorly when the workload is skewed.Specifically, with Zipf 0.95 (0.99) distribution, the NoCachethroughput drops down to only 22.5% (15.6%), compared tothe throughput under the uniform workload. By introducingonly a small cache, NetCache effectively reduces the loadimbalances and thus improves the throughput. Overall, Net-Cache improves the throughput by 3.6⇥, 6.5⇥, and 10⇥ overNoCache, under Zipf 0.9, 0.95 and 0.99, respectively.

10

Onecanfurtherincreasethevaluesizeswithmorestages,recirculation,ormirroring.

Yes,it’sBillionQueriesPerSec,notatypoJ

Andits“notsoboring”benefits

NetCacheprovides3-10xthroughputimprovements.

Throughput of a key-value storage rack with one Tofino switch and 128 storage servers.

uQiforP ziSf-0.9 ziSf-0.95 ziSf-0.99WorNloDd DisWribuWioQ

0.0

0.5

1.0

1.5

2.0

Thro

ughS

uW (B

QP

S)

1oCDche 1eWCDche(servers) 1eWCDche(cDche)

NetCache is a key-value store that leverages

&Billions of queries/sec a few usec latency

even under

workloads.

&highly-skewed rapidly-changing

in-network caching to achieve

Summingitup…

40

Whydata-planeprogramming?1. Newfeatures:Realizeyourbeautifulideasveryquickly2. Reducecomplexity:Removeunnecessaryfeaturesandtables3. EfficientuseofH/Wresources:Achievebiggestbangforbuck4. Greatervisibility:Newdiagnostics,telemetry,OAM,etc.5. Modularity:Composeforwardingbehaviorfromlibraries6. Portability:Specifyforwardingbehavioronce;compiletomanydevices7. Ownyourownideas:Noneedtoshareyourideaswithothers

“Protocols are being lifted off chips and into software”– Ben Horowitz

41

• PISAandP4:Thefirstattempttodefineamachinearchitectureandprogrammingmodelsfornetworkinginadisciplinedway

• Networkisbecomingyetanotherprogrammableplatform

• It’sfuntofigureoutthebestworkloadsforthisnewmachinearchitecture

42

Myobservations

Wanttofindmoreresourcesorfollowup?• Visithttp://p4.org andhttp://github.com/p4lang

– P4languagespec– P4devtoolsandsampleprograms– P4tutorials

• JoinP4workshopsandP4developers’days• ParticipateinP4workinggroupactivities

– Language,targetarchitecture,runtimeAPI,applications• Needmoreexpertiseacrossvariousfieldsincomputerscience

– ToenhancePISA,P4,devtools(e.g.,forformalverification,equivalencecheck,andmanymore…)

43

Thanks.Let’sdevelopyourbeautifulideas

inP4!

44