Titan: Fair Packet Scheduling for Commodity MultiqueueNICs · Titan: Fair Packet Scheduling for...
Transcript of Titan: Fair Packet Scheduling for Commodity MultiqueueNICs · Titan: Fair Packet Scheduling for...
Titan:FairPacketSchedulingforCommodityMultiqueue NICsBrentStephens,ArjunSinghvi,AdityaAkella,andMikeSwift
July13th,2017
Ethernetline-ratesareincreasing!
2
Serversneed:
3
Todriveincreasingline-rates
LowCPUutilizationnetworking
Underlyingmechanisms:
4
SegmentationOffload
Multiqueue NICs
Usinglargesegments(64KB)insteadofpacketscanreduceCPUload
5
F1F2F1F2
Wire
F1
F2
Wire
TCPSegmentationOffload(TSO)
• ManyoperationsperformedbytheOSareper-packet,notper-byte• TSOallowstheOStosendlargesegmentstotheNIC• TSONIChardwaregeneratespacketsfromsegments
Core2Core1
Multiqueue NICsenableparallelism6
Multiqueue NICs
TXQ-2TXQ-1
Wire
PacketScheduler
F1
F3
F2
F2Locking/Polling
Wire
Core1 Core2
F1F2F3
FairnessProblems
7TSOandmultiqueue causepervasiveunfairness
Core2Core1TXQ-2TXQ-1
Wire
PacketScheduler
F1F3
F2F2
Wire
F1 F3 F2F1F2F2F2 F3TSO
unfairnessMultiqueueunfairness
WireF3
Fairpacket
schedule:
Actualpacket
schedule:F1 F2F1F3 F2F1F3 F2
Fairnessisimportant
8
• Fairnessisneededsocompetingapplicationscansharethenetwork
• Fairnessisneededforpredictability• Unfairnessleadstounpredictablecompletiontimesacrossruns• Perfectfairness→perfectpredictability
• Fairnesscanimproveapplicationperformance• Ex:WeightedCoflow Scheduling
• [ChowdhurySIGCOMM11,ChowdhurySIGCOMM14]
TitanGoals:
9
Driveincreasingline-rates
LowCPUutilization
Per-flowfairness
Workoncommodity
NICs
Multiqueue FairnessinLinux:
• Flowarrivalstoeachtransmitqueuearedynamic• TheOSstatically usesaper-flowhashtoassignflowstoqueues• TheNICschedulerstatically usesdeficitround-robin(DRR)toprovideper-queuefairness• Inthedatacenter,theOSstatically choosesaTSOsize
10
TitanDesign:Asflowsdynamicallyarriveandcomplete,inTitan:TheOSdynamically:• Assignsweightstoflows• Trackstheflowoccupancyofqueues• Picksqueuesforflows• UpdatestheNICwithqueueweights
TheNICdynamically:• AppliesqueueweightsfromtheOS
CausesofUnfairness:
12
Multiqueue unfairness TSOunfairness
Problem:Hashcollisions
13
TXQ-2TXQ-1
Wire
PacketScheduler
F1
F3
TXQ-3
F2
Wire
F1F3 F2F1F2F2F2 F3
Multiqueueunfairness
Problem:Hashcollisions
14
TXQ-2TXQ-1
Wire
PacketScheduler
F1
TXQ-3
F2
Solution:DynamicQueueAssignment(DQA)• OSassignsaweighttoeachflow• DQApicksthequeuewiththelowestoccupancywhenaflowstarts• Queueoccupanciesareupdated:• Anytimeaflowstartsenqueuing data• Anytimeaflowhasnoenqueued bytes(atmosteachTXinterrupt)
F3
Problem:Hashcollisions
15
TXQ-2TXQ-1
Wire
PacketScheduler
F1
TXQ-3
F2
Wire
F1F3 F2
F3
Solution:DynamicQueueAssignment(DQA)
F1F3 F2F1F3 F2
Problem:AsymmetricOversubscription
16
TXQ-2TXQ-1
Wire
PacketScheduler
F1
TXQ-3
F3F2
F4
Wire
F1F3F4F1F3F4F2F3F4F2F3F4
F1andF2receivehalfthroughput
W:1 W:1 W:1
Problem:AsymmetricOversubscription
17
Solution:DynamicQueueWeightAssignment(DQWA)
TXQ-2TXQ-1
Wire
PacketScheduler
F1
TXQ-3
F3F2
F4
ndo_set_tx_weight
• OSassignsweightstoflows• OSupdatestheNICschedulerwithqueueoccupanciesasflowsstartandstop(atmosteachTXinterrupt)• NICupdatesDRRweights
W:2 W:1 W:1
ThisisimplementableonexistingcommodityNICsbecauseitonlyneedstoupdateDRRweights!
Problem:AsymmetricOversubscription
18
Solution:DynamicQueueWeightAssignment(DQWA)
TXQ-2TXQ-1
Wire
PacketScheduler
F1
TXQ-3
F3F2
F4
ndo_set_tx_weight
Wire
F1F3F4 F1F2F3F4 F2
DQAandDQWAprovidelong-termfairness
W:2 W:1 W:1
ThisisimplementableonexistingcommodityNICsbecauseitonlyneedstoupdateDRRweights!
Problem:TSOUnfairness
19
TXQ-2TXQ-1
Wire
PacketScheduler
F1
TXQ-3
F3F2
F4
Wire
F1F3F4 F1F2F3F4 F2Short-termunfairness
W:2 W:1 W:1
• Short-termunfairnesscancauseburstsofcongestioninthenetwork• Short-termunfairnesscanincreaselatency
Problem:TSOUnfairness
20
Solution:DynamicSegmentationOffloadSizing(DSOS)
TXQ-2TXQ-1
Wire
PacketScheduler
F1
TXQ-3
F3F2 F4
Wire
F1F3F4 F2F1F3F4 F2
• DSOSdynamicallychangesthesegmentsizeduringoversubscription• SameimplementationasGSO
• CPUvsfairnesstradeoff• SegmentingaftertheTCP/IPstackreducesCPUcosts
F1F2
W:2 W:1 W:1
Implementation
• DQA,DQWA,andDSOSareimplementedinLinux4.4.6
• Supportforndo_set_tx_weight isimplementedintheIntelixgbe driverfortheIntel8259910GbpsNIC
• Titanisopensource!
21https://github.com/bestephe/titan
Evaluation• Microbenchmarks• 2servers,1switch• 8queueNICs• Varynumberofflows(levelofoversubscription)
• IncrementalfairnessbenefitsofDQA,DQWA,andDSOS• DQAandDQWA:expectedtoimprovelong-termfairness
• DSOS:expectedtoimproveshort-termfairness
22
Evaluation– FairnessMetricMetrics:• Normalizedfairnessmetric
(NFM)inspiredbyShreedhar andVarghese:• NFM=0isfair• NFM>1isveryunfair
23
Wire
F1F3 F2F1F2F2F2 F3Wire
F3Idealpacket
schedule:
Unfairpacket
schedule:
F1F2F1F3 F2F1F3 F2NFM=0
NFM=1
NFM = (Bytes(MaxFlow) –Bytes(MinFlow)) /Bytes(FairShair)
Microbenchmarks – 1sTimescale
24
0
0.5
1
1.5
2
2.5
6 12 24 48
NFM
-1s
NumberofFlowsLinux DQA DQA+DQWA DQA+DQWA+DSOS(16KB)
• Linuxisunfairatallsubscriptionlevels• DQAoftensignificantlyimprovesfairness• At48flows,flowchurnpreventsDQAfromevenlyspreadingflows
• DQWAimprovesfairnesswhenDQAcannotevenlyspreadflowsacrossqueues• DSOSdoesnothaveasignificantimpactonlong-termfairness
Microbenchmarks – 1msTimescale
25
0
1
2
3
4
5
6
6 12 24 48
NFM
-1ms
NumberofFlowsLinux DQA DQA+DQWA DQA+DQWA+DSOS(16KB)
• Atshorttimescalesandunderoversubscription,DQAandDQWAdonotsignificantlyimprovefairness• TSOistheprimarycauseofunfairness
• DSOS(16KB)oftenreducesunfairnessby>2x
ClusterExperiments
26
CDFofcompletiontimesina1GBall-to-allshuffle(24servers)
2.5 3.0 3.5 4.0 4.5 5.0 5.5Flow Completion Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
Cum
ulat
ive
Prob
abilit
y (a) 6 servers
4 5 6 7 8 9 101112Flow Completion Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
Cum
ulat
ive
Prob
abilit
y (b) 12 servers
10 12 14 16 18 20 22 24Flow Completion Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
Cum
ulat
ive
Prob
abilit
y (c) 24 servers
Vanilla Vanilla (Cmax) Titan
2.5 3.0 3.5 4.0 4.5 5.0 5.5Flow Completion Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
Cum
ulat
ive
Prob
abilit
y (a) 6 servers
4 5 6 7 8 9 101112Flow Completion Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
Cum
ulat
ive
Prob
abilit
y (b) 12 servers
10 12 14 16 18 20 22 24Flow Completion Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
Cum
ulat
ive
Prob
abilit
y (c) 24 servers
Vanilla Vanilla (Cmax) Titan
2.5 3.0 3.5 4.0 4.5 5.0 5.5Flow Completion Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
Cum
ulat
ive
Prob
abilit
y (a) 6 servers
4 5 6 7 8 9 101112Flow Completion Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
Cum
ulat
ive
Prob
abilit
y (b) 12 servers
10 12 14 16 18 20 22 24Flow Completion Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
Cum
ulat
ive
Prob
abilit
y (c) 24 servers
Vanilla Vanilla (Cmax) TitanLinuxTitanimprovesfairnesswithoutchangingthenetworkcore!
• IdealCDFwouldbeaverticalline• Titanmakesperformancemorepredictable• Titanimprovestailperformance(>90th percentile)
AdditionalEvaluationAdditionalperformancemetrics:• Throughput:line-rate• Latency:nosignificantchange• CPUUtilization:• DQAandDQWA:increase<10%• DSOSisbetterthanstaticallydecreasing
theTSOsize• DSOSmotivatescreatingabetterTSO
implementation(zero-copy)
Linuxnetworkconfigurationtrade-offstudy• Seepaper
27
Summary
•MultiqueueNICscanleadtosignificantflow-levelunfairness• TitansignificantlyimprovesfairnessbyallowingtheOStodynamically interactwiththeNICpacketscheduler• TitanisimplementableoncommodityNICs!
28
https://github.com/bestephe/titan