ProgrammingTheNetworkDataPlane
ChanghoonKim
Beautifulideas:Whatifyoucould…• Realizeasmall,butsuper-fastDNScache• PerformTCPSYNauthenticationforbillionsofSYNspersec• Buildareplicatedkey-valuestoreensuringRWopsinafewusecs• Improveyourconsensusserviceperformanceby~100x• BoostyourMemcached cluster’sthroughputby~10x• SpeedupyourDNNtrainingdramaticallybyrealizingparameter
servers
2
… usingswitches inyournetwork?
Youcouldn’tdoanyofthosesofarbecause…
• NoDIY– mustworkwithvendorsatfeature level• Excruciatinglycomplicatedandinvolvedprocesstobuild
consensusandpressureforfeatures• Painfullylongandunpredictableleadtime• Tousenewfeatures,youmustgetnewswitches• Whatyoufinallyget!=whatyouaskedfor
3
Thisisveryunnaturaltodevelopers• Becauseyouallknowhowtorealizeyourownideasby
“programming”CPUs– Programsusedineveryphase(implement,test,anddeploy)– Extremelyfastiterationanddifferentiation– Youownyourownideas– Asustainableecosystemwhereallparticipantsbenefit
4
Canwereplicatethishealthy,sustainableecosystemfornetworking?
Reality:Packetforwardingspeeds
0.1
1
10
100
1000
10000
100000
1990 1995 2000 2005 2010 2015 2020
Switch ChipCPU
5
Gb/s(perchip)
6.4Tb/s
Reality:Packetforwardingspeeds
0.1
1
10
100
1000
10000
100000
1990 1995 2000 2005 2010 2015 2020
Switch ChipCPU
6
80xGb/s(perchip)
6.4Tb/s
7
Whatdoesatypicalswitchlooklike?
Chip Driver
Run-time API
Protocol Daemons(BGP, OSPF, etc.)
Other MgmtApps
Data plane
Control plane
SwitchOS(Linuxvariant)
PCIe
…
L2 Forwarding
Table
L3 Routing
Table
ACLTable
…
A switch is just a Linux box with a high-speed switching chip
packets packets
JustS/W-- Youcanfreelychangethis
Fixed-functionH/W-- There’snothing
youcanchangehere
Networkingsystemshavebeenbuilt“bottoms-up”
SwitchOS
“ThisisroughlyhowIprocesspackets…”
Fixed-functionswitch
inEnglish
API
Turningthetables“top-down”
SwitchOS
“Thisispreciselyhowyoumustprocesspackets”
ProgrammableSwitch
inP4
API
“Programmableswitchesare10-100xslowerthanfixed-functionswitches.Theycostmore
andconsumemorepower.”Conventionalwisdominnetworking
Evidence:Tofino6.5Tb/sswitch(arrivedDec2016)
Theworld’sfastestand mostprogrammableswitch.Nopower,cost,orpowerpenaltycomparedtofixed-functionswitches.AnincarnationofPISA(ProtocolIndependentSwitchArchitecture)
Domain-specificprocessors
CPU
Computers
JavaCompiler
GPU
Graphics
OpenCLCompiler
DSP
SignalProcessing
MatlabCompiler
MachineLearning
?
TPU
TensorFlow
Compiler
Networking
?
LanguageCompiler>>>
CPU
Computers
JavaCompiler
GPU
Graphics
OpenCLCompiler
DSP
SignalProcessing
MatlabCompiler
MachineLearning
?
TPU
TensorFlow
Compiler
PISA
Networking
P4Compiler>>>
Domain-specificprocessors
PISA: An architecture for high-speed programmable packet forwarding
14
15
Prog
ram
mab
lePa
rser
MatchMemory
ActionALU
PISA: Protocol Independent Switch Architecture
16
Prog
ram
mab
lePa
rser
PISA: Protocol Independent Switch Architecture
Ingress EgressBuffer
BufferM M
17
Prog
ram
mab
lePa
rser
PISA: Protocol Independent Switch ArchitectureMatch Logic(Mix of SRAM and TCAM for lookup tables, counters, meters, generic hash tables)
Action Logic(ALUs for standard boolean and arithmetic operations, header modification operations, hashing operations, etc.)
Recirculation
ProgrammablePacketGenerator
CPU (Control plane)
A
…
A
…
Ingress match-action stages (pre-switching) Egress match-action stages (post-switching)
Generalization of RMT [sigcomm’13]
Why we call itprotocol-independent packet
processing
18
Logical Data-plane View(your P4 program)Switch Pipeline
Device does not understand any protocols until it gets programmed
Queues
Prog
ram
mab
lePa
rser
Fixe
d Ac
tion
Mat
ch T
able
Mat
ch T
able
Mat
ch T
able
Mat
ch T
able
L2IPv4
IPv6ACL
Actio
n AL
Us
Actio
n AL
Us
Actio
n AL
Us
Actio
n AL
Uspacketpacket packetpacket
CLK19
Mat
ch T
able
Actio
n AL
Us
Mapping logical data-plane design to physical resources
Queues
Mat
ch T
able
Mat
ch T
able
Mat
ch T
able
L2 T
able
IPv4
Tab
le
IPv6
Tab
le
ACL
Tabl
e
Actio
n AL
Us
Actio
n AL
Us
Actio
n AL
Us
L2IPv4
IPv6ACL
Logical Data-plane View (your P4 program)Switch Pipeline
L2IPv6
ACLIPv4
L2 A
ctio
n M
acro
v4 A
ctio
n M
acro
v6 A
ctio
n M
acro
ACL
Actio
n M
acro
Prog
ram
mab
lePa
rser
CLK20
Re-program in the field
L2 T
able
IPv4
Tab
le
ACL
Tabl
e
IPv6
Tab
le
My
Enca
p
L2IPv4
IPv6ACLMyEncap
L2 A
ctio
n M
acro
v4 A
ctio
n M
acro
ACL
Actio
n M
acro
Actio
n
MyEncap
v6 A
ctio
n M
acro
IPv4
Actio
n
IPv4
Actio
n
IPv6
Actio
n
IPv6
Prog
ram
mab
lePa
rser
CLK
Logical Data-plane View (your P4 program)Switch Pipeline
Queues
21
P
4
:
P
r
o
g
r
a
m
m
i
n
g
P
r
o
t
o
c
o
l
-
I
n
d
e
p
e
n
d
e
n
t
P
a
c
k
e
t
P
r
o
c
e
s
s
o
r
s
P
a
t
B
o
s
s
h
a
r
t
†
,
D
a
n
D
a
l
y
*,
G
l
e
n
G
i
b
b
†
,
M
a
r
t
i
n
I
z
z
a
r
d
†
,
N
i
c
k
M
c
K
e
o
w
n
‡
,
J
e
n
n
i
f
e
r
R
e
x
f
o
r
d
**,
C
o
l
e
S
c
h
l
e
s
i
n
g
e
r
**,
D
a
n
T
a
l
a
y
c
o
†
,
A
m
i
n
V
a
h
d
a
t
¶
,
G
e
o
r
g
e
V
a
r
g
h
e
s
e
§
,
D
a
v
i
d
W
a
l
k
e
r
**
†
B
a
r
e
f
o
o
t
N
e
t
w
o
r
k
s *I
n
t
e
l
‡
S
t
a
n
f
o
r
d
U
n
i
v
e
r
s
i
t
y **P
r
i
n
c
e
t
o
n
U
n
i
v
e
r
s
i
t
y ¶
G
o
o
g
l
e §
M
i
c
r
o
s
o
f
t
R
e
s
e
a
r
c
h
ABSTRACTP4 is a high-level language for programming protocol-inde-
pendent packet processors. P4 works in conjunction with
SDN control protocols like OpenFlow. In its current form,
OpenFlow explicitly specifies protocol headers on which it
operates. This set has grown from 12 to 41 fields in a few
years, increasing the complexity of the specification while
still not providing the flexibility to add new headers. In this
paper we propose P4 as a strawman proposal for how Open-
Flow should evolve in the future. We have three goals: (1)
Reconfigurability in the field: Programmers should be able
to change the way switches process packets once they are
deployed. (2) Protocol independence: Switches should not
be tied to any specific network protocols. (3) Target inde-
pendence: Programmers should be able to describe packet-
processing functionality independently of the specifics of the
underlying hardware. As an example, we describe how to
use P4 to configure a switch to add a new hierarchical label.
1. INTRODUCTIONSoftware-Defined Networking (SDN) gives operators pro-
grammatic control over their networks. In SDN, the con-
trol plane is physically separate from the forwarding plane,
and one control plane controls multiple forwarding devices.
While forwarding devices could be programmed in many
ways, having a common, open, vendor-agnostic interface
(like OpenFlow) enables a control plane to control forward-
ing devices from di↵erent hardware and software vendors.
VersionDate
Header Fields
OF 1.0Dec 2009 12 fields (Ethernet, TCP/IPv4)
OF 1.1Feb 2011 15 fields (MPLS, inter-table metadata)
OF 1.2Dec 2011 36 fields (ARP, ICMP, IPv6, etc.)
OF 1.3Jun 2012 40 fields
OF 1.4Oct 2013 41 fields
Table 1: Fields recognized by the OpenFlow standard
The OpenFlow interface started simple, with the abstrac-
tion of a single table of rules that could match packets on a
dozen header fields (e.g., MAC addresses, IP addresses, pro-
tocol, TCP/UDP port numbers, etc.). Over the past five
years, the specification has grown increasingly more com-
plicated (see Table 1), with many more header fields and
multiple stages of rule tables, to allow switches to expose
more of their capabilities to the controller.
The proliferation of new header fields shows no signs of
stopping. For example, data-center network operators in-
creasingly want to apply new forms of packet encapsula-
tion (e.g., NVGRE, VXLAN, and STT), for which they re-
sort to deploying software switches that are easier to extend
with new functionality. Rather than repeatedly extending
the OpenFlow specification, we argue that future switches
should support flexible mechanisms for parsing packets and
matching header fields, allowing controller applications to
leverage these capabilities through a common, open inter-
face (i.e., a new “OpenFlow 2.0” API). Such a general, ex-
tensible approach would be simpler, more elegant, and more
future-proof than today’s OpenFlow 1.x standard.
Figure 1: P4 is a language to configure switches.
Recent chip designs demonstrate that such flexibility can
be achieved in custom ASICs at terabit speeds [1, 2, 3]. Pro-
gramming this new generation of switch chips is far from
easy. Each chip has its own low-level interface, akin to
microcode programming. In this paper, we sketch the de-
sign of a higher-level language for Programming Protocol-
independent Packet Processors (P4). Figure 1 shows the
relationship between P4—used to configure a switch, telling
it how packets are to be processed—and existing APIs (such
as OpenFlow) that are designed to populate the forwarding
tables in fixed function switches. P4 raises the level of ab-
straction for programming the network, and can serve as a
ACM SIGCOMM Computer Communication Review88
Volume 44, Number 3, July 2014
What does a P4 program look like?
L2IPv4
ACLMyEncapIPv6
header_type ethernet_t {fields {
dstAddr : 48;srcAddr : 48;etherType : 16;
}}
parser parse_ethernet {extract(ethernet);return select(latest.etherType) {
0x8100 : parse_vlan;0x800 : parse_ipv4;0x86DD : parse_ipv6;
}}
TCP
IPv4 IPv6
MyEncapEth
header_type my_encap_t {fields {
foo : 12;bar : 8;baz : 4;qux : 4; next_protocol : 4;
}}
23
What does a P4 program look like?
L2IPv4
ACLMyEncapIPv6
table ipv4_lpm {
reads {ipv4.dstAddr : lpm;
}actions {
set_next_hop; drop; }
}
action set_next_hop(nhop_ipv4_addr, port) {
modify_field(metadata.nhop_ipv4_addr, nhop_ipv4_addr); modify_field(standard_metadata.egress_port, port); add_to_field(ipv4.ttl, -1);
}
control ingress {
apply(l2);apply(my_encap);if (valid(ipv4) {
apply(ipv4_lpm);} else {
apply(ipv6_lpm);}apply(acl);
}
24
P4.org (http://p4.org)§ Open-source community to nurture the language
§ Open-source software – Apache license§ A common language: P416
§ Support for various types of devices and targets§ Enable a wealth of innovation
§ Diverse “apps” (including proprietary ones) running on commodity targets
§ With no barrier to entry§ Free of membership fee, free of commitment, and simple licensing
25
So,whatkindsofexcitingnewopportunitiesarearising?
26
Thenetworkshouldanswerthesequestions
1. “Whichpathdidmypackettake?”2. “Whichrulesdidmypacketfollow?”3. “Howlongdiditqueueateachswitch?”4. “Whodiditsharethequeueswith?”
PISA+P4cananswerallfourquestionsforthefirsttime.Atfulllinerate.Withoutgeneratinganyadditionalpackets!
1234
Log,AnalyzeReplay
Add:SwitchID,ArrivalTime,QueueDelay,MatchedRules,…
OriginalPacket
Visualize
In-bandNetworkTelemetry(INT)Aread-onlyversionofTinyPacketPrograms[sigcomm’14]
AquickdemoofINT!
29
30
Whatdoesthismeantoyou?• Improveyourdistributedapps’performancewithtelemetrydata
• Askthefourkeyquestionsregardingyourpacketstonetworkadminsorcloudproviders
• HugeopportunitiesforBig-dataprocessingandmachine-learningexperts
• “Self-driving”networkisnothyperbole
31
PISA: An architecture for high-speed programmable packet forwarding
32
event processing
Whatwehaveseensofar:Addingnewnetworkingfeatures
1. Newencapsulationsandtunnels2. Newwaystotagpacketsforspecialtreatment3. Newapproachestorouting:e.g.,sourceroutingindata-
centernetworks4. Newapproachestocongestioncontrol5. Newwaystomanipulateandforwardpackets:e.g.splitting
tickersymbolsforhigh-frequencytrading
33
Whatwehaveseensofar:World’sfastestmiddleboxes
1. Layer-4loadconnectionbalancingatTb/s– Replace100sofserversor10sofdedicatedapplianceswithonePISAswitch– Trackandmaintainmappingsfor5~10millionHTTPconnections
2. StatelessfirewallorDDoSdetector– Add/deleteandtrack100softhousandsofnewconnectionspersecond– Includeotherstatelessline-ratefunctions
(e.g.,TCPSYNauthentication,sketches,orBloomfilter-basedwhitelisting)
34
Whatwehaveseensofar:Offloadingpartofcomputingtonetwork
1. DNScache2. Key-valuecache[ACMSOSP’17]
3. Chainreplication4. Paxos [ACMCCR’16]andRAFT5. ParameterserviceforDNNtraining
35
Example:NetCache
Clients
Key-Value Cache
Query Statistics
High-performance Storage Servers
Key-Value Storage RackController
L2/L3 Routing
ToR Switch Data plane
• Non-goal– Maximizethecachehitrate
• Goal– Balancetheworkloadsofbackendserversbyserving
onlyO(NlogN)hotitems-- N isthenumberofbackendservers
– Makethe“fast,small-cache”theoryviableformodernin-memoryKVservers[Fanet.al.,SOCC’11]
• Dataplane– Unmodifiedrouting– Key-valuecachebuiltwithon-chipSRAM– Querystatisticstodetecthotitems
• Controlplane– Updatecachewithhotitemstohandledynamic
workloads
The“boringlife”ofaNetCache switch
test the switch performance at full traffic load. The valueprocess is executed each time when the packet passes anegress port. To avoid packet size keeps increasing for readqueries, we remove the value field at the last egress stagefor all intermediate ports. The servers can still verify thevalues as they are kept in the two ports connected to them.
• Server rotation for static workloads (§6.3). We use onemachine as a client, and the other as a storage server. Weinstall the hot items in the switch cache as for a full stor-age rack and have the client send traffic according to a Zipfdistribution. For each experiment, the storage server takesone key-value partition and runs as one node in the rack.By rotating the storage server for all 128 partitions (i.e.,performing the experiment for 128 times), we aggregatethe results to obtain the result for the entire rack. Suchresult aggregation is justified by (i) the shared-nothingarchitecture of key-value stores and (ii) the microbench-mark that demonstrates the switch is not the bottleneck.
To find the maximum effective system throughput, wefirst find the bottleneck partition and use that server in thefirst iteration. The client generates queries destined to thisparticular partition, and adjusts its sending rate to controlthe packet loss rate between 0.5% to 1%. This sending rategives the saturated throughput of the bottleneck partition.We obtain the traffic load for the full system based on thissending rate, and use this load to generate per-partitionquery load for remaining partitions. Since the remainingpartitions are not the bottleneck partition, they should beable to fully serve the load. We sum up the throughputs ofall partitions to obtain the aggregate system throughput.
• Server emulation for dynamic workloads (§6.4). Serverrotation is not suitable for evaluating dynamic workloads.This is because we would like to measure the transient be-havior of the system, i.e., how the system performancefluctuates during cache updates, rather than the systemperformance at the stable state. To do this, we emulate128 storage servers on one server by using 128 queues.Each queue processes queries for one key-value partitionand drops queries if the received queries exceed its pro-cessing rate. To evaluate the real-time system throughput,the client tracks the packet loss rate, and adjusts its send-ing rate to keep the loss rate between 0.5% to 1%. Theaggregate throughput is scaled down by a factor of 128.Such emulation is reasonable because in these experimentswe are more interested in the relative performance fluctu-ations when NetCache reacts to workload changes, ratherthan the absolute performance numbers.
6.2 Switch MicrobenchmarkWe first show switch microbenchmark results using snake
test (as described in §6.1). We demonstrate that NetCache isable to run on programmable switches at line rate.
Throughput vs. value size. We populate the switch cachewith 64K items and vary the value size. Two servers and
0 32 64 96 1289alue 6ize (Byte)
0.0
0.5
1.0
1.5
2.0
2.5
Thro
ughS
ut (B
43
6)
(a) Throughput vs. value size. (b) Throughput vs. cache size.
Figure 9: Switch microbenchmark (read and update).one switch are organized to a snake structure. The switchis configured to provide 62 100Gbps ports, and two 40Gbpsports to connect servers. We let the two servers send cacheread and update queries to each other and measure the maxi-mum throughput. Figure 9(a) shows the switch provides 2.24BQPS throughput for value size up to 128 bytes. This is bot-tlenecked by the maximum sending rate of the servers (35MQPS). The Barefoot Tofino switch is able to achieve morethan 4 BQPS. The throughput is not affected by the value sizeor the read/update ratio. This is because the switch ASIC isdesigned to process packets with strict timing requirements.As long as our P4 program is complied to fit the hardwareresources, the data plane can process packets at line rate.
Our current prototype supports value size up to 128 bytes.Bigger values can be supported by using more stages or usingpacket mirroring for a second round of process (§4.4.2).
Throughput vs. cache size. We use 128 bytes as the valuesize and change the cache size. Other settings are the sameas the previous experiment. Similarly, Figure 9(b) shows thatthe throughput keeps at 2.24 BQPS and is not affected by thecache size. Since our current implementation allocates 8 MBmemory for the cache, the cache size cannot be larger than64K for 128-byte values. We note that caching 64K items issufficient for balancing a key-value storage rack.
6.3 System PerformanceWe now present the system performance of a NetCache
key-value storage rack that contains one switch and 128 stor-age servers using server rotation (as described in §6.1).
Throughput. Figure 10(a) shows the system throughput un-der different skewness parameters with read-only queries and10,000 items in the cache. We compare NetCache with No-Cache which does not have the switch cache. In addition,we also show the the portions of the NetCache throughputprovided by the cache and the storage servers respectively.NoCache performs poorly when the workload is skewed.Specifically, with Zipf 0.95 (0.99) distribution, the NoCachethroughput drops down to only 22.5% (15.6%), compared tothe throughput under the uniform workload. By introducingonly a small cache, NetCache effectively reduces the loadimbalances and thus improves the throughput. Overall, Net-Cache improves the throughput by 3.6⇥, 6.5⇥, and 10⇥ overNoCache, under Zipf 0.9, 0.95 and 0.99, respectively.
10
test the switch performance at full traffic load. The valueprocess is executed each time when the packet passes anegress port. To avoid packet size keeps increasing for readqueries, we remove the value field at the last egress stagefor all intermediate ports. The servers can still verify thevalues as they are kept in the two ports connected to them.
• Server rotation for static workloads (§6.3). We use onemachine as a client, and the other as a storage server. Weinstall the hot items in the switch cache as for a full stor-age rack and have the client send traffic according to a Zipfdistribution. For each experiment, the storage server takesone key-value partition and runs as one node in the rack.By rotating the storage server for all 128 partitions (i.e.,performing the experiment for 128 times), we aggregatethe results to obtain the result for the entire rack. Suchresult aggregation is justified by (i) the shared-nothingarchitecture of key-value stores and (ii) the microbench-mark that demonstrates the switch is not the bottleneck.
To find the maximum effective system throughput, wefirst find the bottleneck partition and use that server in thefirst iteration. The client generates queries destined to thisparticular partition, and adjusts its sending rate to controlthe packet loss rate between 0.5% to 1%. This sending rategives the saturated throughput of the bottleneck partition.We obtain the traffic load for the full system based on thissending rate, and use this load to generate per-partitionquery load for remaining partitions. Since the remainingpartitions are not the bottleneck partition, they should beable to fully serve the load. We sum up the throughputs ofall partitions to obtain the aggregate system throughput.
• Server emulation for dynamic workloads (§6.4). Serverrotation is not suitable for evaluating dynamic workloads.This is because we would like to measure the transient be-havior of the system, i.e., how the system performancefluctuates during cache updates, rather than the systemperformance at the stable state. To do this, we emulate128 storage servers on one server by using 128 queues.Each queue processes queries for one key-value partitionand drops queries if the received queries exceed its pro-cessing rate. To evaluate the real-time system throughput,the client tracks the packet loss rate, and adjusts its send-ing rate to keep the loss rate between 0.5% to 1%. Theaggregate throughput is scaled down by a factor of 128.Such emulation is reasonable because in these experimentswe are more interested in the relative performance fluctu-ations when NetCache reacts to workload changes, ratherthan the absolute performance numbers.
6.2 Switch MicrobenchmarkWe first show switch microbenchmark results using snake
test (as described in §6.1). We demonstrate that NetCache isable to run on programmable switches at line rate.
Throughput vs. value size. We populate the switch cachewith 64K items and vary the value size. Two servers and
(a) Throughput vs. value size.
0 16. 32. 48. 64.CacKe 6ize
0.0
0.5
1.0
1.5
2.0
2.5
TKro
ugKS
ut (B
43
6)
(b) Throughput vs. cache size.
Figure 9: Switch microbenchmark (read and update).one switch are organized to a snake structure. The switchis configured to provide 62 100Gbps ports, and two 40Gbpsports to connect servers. We let the two servers send cacheread and update queries to each other and measure the maxi-mum throughput. Figure 9(a) shows the switch provides 2.24BQPS throughput for value size up to 128 bytes. This is bot-tlenecked by the maximum sending rate of the servers (35MQPS). The Barefoot Tofino switch is able to achieve morethan 4 BQPS. The throughput is not affected by the value sizeor the read/update ratio. This is because the switch ASIC isdesigned to process packets with strict timing requirements.As long as our P4 program is complied to fit the hardwareresources, the data plane can process packets at line rate.
Our current prototype supports value size up to 128 bytes.Bigger values can be supported by using more stages or usingpacket mirroring for a second round of process (§4.4.2).
Throughput vs. cache size. We use 128 bytes as the valuesize and change the cache size. Other settings are the sameas the previous experiment. Similarly, Figure 9(b) shows thatthe throughput keeps at 2.24 BQPS and is not affected by thecache size. Since our current implementation allocates 8 MBmemory for the cache, the cache size cannot be larger than64K for 128-byte values. We note that caching 64K items issufficient for balancing a key-value storage rack.
6.3 System PerformanceWe now present the system performance of a NetCache
key-value storage rack that contains one switch and 128 stor-age servers using server rotation (as described in §6.1).
Throughput. Figure 10(a) shows the system throughput un-der different skewness parameters with read-only queries and10,000 items in the cache. We compare NetCache with No-Cache which does not have the switch cache. In addition,we also show the the portions of the NetCache throughputprovided by the cache and the storage servers respectively.NoCache performs poorly when the workload is skewed.Specifically, with Zipf 0.95 (0.99) distribution, the NoCachethroughput drops down to only 22.5% (15.6%), compared tothe throughput under the uniform workload. By introducingonly a small cache, NetCache effectively reduces the loadimbalances and thus improves the throughput. Overall, Net-Cache improves the throughput by 3.6⇥, 6.5⇥, and 10⇥ overNoCache, under Zipf 0.9, 0.95 and 0.99, respectively.
10
Onecanfurtherincreasethevaluesizeswithmorestages,recirculation,ormirroring.
Yes,it’sBillionQueriesPerSec,notatypoJ
Andits“notsoboring”benefits
NetCacheprovides3-10xthroughputimprovements.
Throughput of a key-value storage rack with one Tofino switch and 128 storage servers.
uQiforP ziSf-0.9 ziSf-0.95 ziSf-0.99WorNloDd DisWribuWioQ
0.0
0.5
1.0
1.5
2.0
Thro
ughS
uW (B
QP
S)
1oCDche 1eWCDche(servers) 1eWCDche(cDche)
NetCache is a key-value store that leverages
&Billions of queries/sec a few usec latency
even under
workloads.
&highly-skewed rapidly-changing
in-network caching to achieve
Summingitup…
40
Whydata-planeprogramming?1. Newfeatures:Realizeyourbeautifulideasveryquickly2. Reducecomplexity:Removeunnecessaryfeaturesandtables3. EfficientuseofH/Wresources:Achievebiggestbangforbuck4. Greatervisibility:Newdiagnostics,telemetry,OAM,etc.5. Modularity:Composeforwardingbehaviorfromlibraries6. Portability:Specifyforwardingbehavioronce;compiletomanydevices7. Ownyourownideas:Noneedtoshareyourideaswithothers
“Protocols are being lifted off chips and into software”– Ben Horowitz
41
• PISAandP4:Thefirstattempttodefineamachinearchitectureandprogrammingmodelsfornetworkinginadisciplinedway
• Networkisbecomingyetanotherprogrammableplatform
• It’sfuntofigureoutthebestworkloadsforthisnewmachinearchitecture
42
Myobservations
Wanttofindmoreresourcesorfollowup?• Visithttp://p4.org andhttp://github.com/p4lang
– P4languagespec– P4devtoolsandsampleprograms– P4tutorials
• JoinP4workshopsandP4developers’days• ParticipateinP4workinggroupactivities
– Language,targetarchitecture,runtimeAPI,applications• Needmoreexpertiseacrossvariousfieldsincomputerscience
– ToenhancePISA,P4,devtools(e.g.,forformalverification,equivalencecheck,andmanymore…)
43
Thanks.Let’sdevelopyourbeautifulideas
inP4!
44
Top Related