Titan: Fair Packet Scheduling for Commodity MultiqueueNICs · Titan: Fair Packet Scheduling for...

Titan:FairPacketSchedulingforCommodityMultiqueue NICsBrentStephens,ArjunSinghvi,AdityaAkella,andMikeSwift

July13th,2017

Ethernetline-ratesareincreasing!

2

Serversneed:

3

Todriveincreasingline-rates

LowCPUutilizationnetworking

Underlyingmechanisms:

4

SegmentationOffload

Multiqueue NICs

Usinglargesegments(64KB)insteadofpacketscanreduceCPUload

5

F1F2F1F2

Wire

F1

F2

Wire

TCPSegmentationOffload(TSO)

• ManyoperationsperformedbytheOSareper-packet,notper-byte• TSOallowstheOStosendlargesegmentstotheNIC• TSONIChardwaregeneratespacketsfromsegments

Core2Core1

Multiqueue NICsenableparallelism6

Multiqueue NICs

TXQ-2TXQ-1

Wire

PacketScheduler

F1

F3

F2

F2Locking/Polling

Wire

Core1 Core2

F1F2F3

FairnessProblems

7TSOandmultiqueue causepervasiveunfairness

Core2Core1TXQ-2TXQ-1

Wire

PacketScheduler

F1F3

F2F2

Wire

F1 F3 F2F1F2F2F2 F3TSO

unfairnessMultiqueueunfairness

WireF3

Fairpacket

schedule:

Actualpacket

schedule:F1 F2F1F3 F2F1F3 F2

Fairnessisimportant

8

• Fairnessisneededsocompetingapplicationscansharethenetwork

• Fairnessisneededforpredictability• Unfairnessleadstounpredictablecompletiontimesacrossruns• Perfectfairness→perfectpredictability

• Fairnesscanimproveapplicationperformance• Ex:WeightedCoflow Scheduling

• [ChowdhurySIGCOMM11,ChowdhurySIGCOMM14]

TitanGoals:

9

Driveincreasingline-rates

LowCPUutilization

Per-flowfairness

Workoncommodity

NICs

Multiqueue FairnessinLinux:

• Flowarrivalstoeachtransmitqueuearedynamic• TheOSstatically usesaper-flowhashtoassignflowstoqueues• TheNICschedulerstatically usesdeficitround-robin(DRR)toprovideper-queuefairness• Inthedatacenter,theOSstatically choosesaTSOsize

10

TitanDesign:Asflowsdynamicallyarriveandcomplete,inTitan:TheOSdynamically:• Assignsweightstoflows• Trackstheflowoccupancyofqueues• Picksqueuesforflows• UpdatestheNICwithqueueweights

TheNICdynamically:• AppliesqueueweightsfromtheOS

CausesofUnfairness:

12

Multiqueue unfairness TSOunfairness

Problem:Hashcollisions

13

TXQ-2TXQ-1

Wire

PacketScheduler

F1

F3

TXQ-3

F2

Wire

F1F3 F2F1F2F2F2 F3

Multiqueueunfairness


14

TXQ-2TXQ-1

Wire

PacketScheduler

F1

TXQ-3

F2

Solution:DynamicQueueAssignment(DQA)• OSassignsaweighttoeachflow• DQApicksthequeuewiththelowestoccupancywhenaflowstarts• Queueoccupanciesareupdated:• Anytimeaflowstartsenqueuing data• Anytimeaflowhasnoenqueued bytes(atmosteachTXinterrupt)

F3


15

TXQ-2TXQ-1

Wire

PacketScheduler

F1

TXQ-3

F2

Wire

F1F3 F2

F3

Solution:DynamicQueueAssignment(DQA)

F1F3 F2F1F3 F2

Problem:AsymmetricOversubscription

16

TXQ-2TXQ-1

Wire

PacketScheduler

F1

TXQ-3

F3F2

F4

Wire

F1F3F4F1F3F4F2F3F4F2F3F4

F1andF2receivehalfthroughput

W:1 W:1 W:1


17

Solution:DynamicQueueWeightAssignment(DQWA)

TXQ-2TXQ-1

Wire

PacketScheduler

F1

TXQ-3

F3F2

F4

ndo_set_tx_weight

• OSassignsweightstoflows• OSupdatestheNICschedulerwithqueueoccupanciesasflowsstartandstop(atmosteachTXinterrupt)• NICupdatesDRRweights

W:2 W:1 W:1

ThisisimplementableonexistingcommodityNICsbecauseitonlyneedstoupdateDRRweights!


18

Solution:DynamicQueueWeightAssignment(DQWA)

TXQ-2TXQ-1

Wire

PacketScheduler

F1

TXQ-3

F3F2

F4

ndo_set_tx_weight

Wire

F1F3F4 F1F2F3F4 F2

DQAandDQWAprovidelong-termfairness

W:2 W:1 W:1

ThisisimplementableonexistingcommodityNICsbecauseitonlyneedstoupdateDRRweights!

Problem:TSOUnfairness

19

TXQ-2TXQ-1

Wire

PacketScheduler

F1

TXQ-3

F3F2

F4

Wire

F1F3F4 F1F2F3F4 F2Short-termunfairness

W:2 W:1 W:1

• Short-termunfairnesscancauseburstsofcongestioninthenetwork• Short-termunfairnesscanincreaselatency

Problem:TSOUnfairness

20

Solution:DynamicSegmentationOffloadSizing(DSOS)

TXQ-2TXQ-1

Wire

PacketScheduler

F1

TXQ-3

F3F2 F4

Wire

F1F3F4 F2F1F3F4 F2

• DSOSdynamicallychangesthesegmentsizeduringoversubscription• SameimplementationasGSO

• CPUvsfairnesstradeoff• SegmentingaftertheTCP/IPstackreducesCPUcosts

F1F2

W:2 W:1 W:1

Implementation

• DQA,DQWA,andDSOSareimplementedinLinux4.4.6

• Supportforndo_set_tx_weight isimplementedintheIntelixgbe driverfortheIntel8259910GbpsNIC

• Titanisopensource!

21https://github.com/bestephe/titan

Evaluation• Microbenchmarks• 2servers,1switch• 8queueNICs• Varynumberofflows(levelofoversubscription)

• IncrementalfairnessbenefitsofDQA,DQWA,andDSOS• DQAandDQWA:expectedtoimprovelong-termfairness

• DSOS:expectedtoimproveshort-termfairness

22

Evaluation– FairnessMetricMetrics:• Normalizedfairnessmetric

(NFM)inspiredbyShreedhar andVarghese:• NFM=0isfair• NFM>1isveryunfair

23

Wire

F1F3 F2F1F2F2F2 F3Wire

F3Idealpacket

schedule:

Unfairpacket

schedule:

F1F2F1F3 F2F1F3 F2NFM=0

NFM=1

NFM = (Bytes(MaxFlow) –Bytes(MinFlow)) /Bytes(FairShair)

Microbenchmarks – 1sTimescale

24

0

0.5

1

1.5

2

2.5

6 12 24 48

NFM

-1s

NumberofFlowsLinux DQA DQA+DQWA DQA+DQWA+DSOS(16KB)

• Linuxisunfairatallsubscriptionlevels• DQAoftensignificantlyimprovesfairness• At48flows,flowchurnpreventsDQAfromevenlyspreadingflows

• DQWAimprovesfairnesswhenDQAcannotevenlyspreadflowsacrossqueues• DSOSdoesnothaveasignificantimpactonlong-termfairness

Microbenchmarks – 1msTimescale

25

0

1

2

3

4

5

6

6 12 24 48

NFM

-1ms

NumberofFlowsLinux DQA DQA+DQWA DQA+DQWA+DSOS(16KB)

• Atshorttimescalesandunderoversubscription,DQAandDQWAdonotsignificantlyimprovefairness• TSOistheprimarycauseofunfairness

• DSOS(16KB)oftenreducesunfairnessby>2x

ClusterExperiments

26

CDFofcompletiontimesina1GBall-to-allshuffle(24servers)

2.5 3.0 3.5 4.0 4.5 5.0 5.5Flow Completion Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

Cum

ulat

ive

Prob

abilit

y (a) 6 servers

4 5 6 7 8 9 101112Flow Completion Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

Cum

ulat

ive

Prob

abilit

y (b) 12 servers

10 12 14 16 18 20 22 24Flow Completion Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

Cum

ulat

ive

Prob

abilit

y (c) 24 servers

Vanilla Vanilla (Cmax) Titan


0.0

0.2

0.4

0.6

0.8

1.0

Cum

ulat

ive

Prob

abilit

y (a) 6 servers


0.0

0.2

0.4

0.6

0.8

1.0

Cum

ulat

ive

Prob

abilit

y (b) 12 servers


0.0

0.2

0.4

0.6

0.8

1.0

Cum

ulat

ive

Prob

abilit

y (c) 24 servers

Vanilla Vanilla (Cmax) Titan


0.0

0.2

0.4

0.6

0.8

1.0

Cum

ulat

ive

Prob

abilit

y (a) 6 servers


0.0

0.2

0.4

0.6

0.8

1.0

Cum

ulat

ive

Prob

abilit

y (b) 12 servers


0.0

0.2

0.4

0.6

0.8

1.0

Cum

ulat

ive

Prob

abilit

y (c) 24 servers

Vanilla Vanilla (Cmax) TitanLinuxTitanimprovesfairnesswithoutchangingthenetworkcore!

• IdealCDFwouldbeaverticalline• Titanmakesperformancemorepredictable• Titanimprovestailperformance(>90th percentile)

AdditionalEvaluationAdditionalperformancemetrics:• Throughput:line-rate• Latency:nosignificantchange• CPUUtilization:• DQAandDQWA:increase<10%• DSOSisbetterthanstaticallydecreasing

theTSOsize• DSOSmotivatescreatingabetterTSO

implementation(zero-copy)

Linuxnetworkconfigurationtrade-offstudy• Seepaper

27

Summary

•MultiqueueNICscanleadtosignificantflow-levelunfairness• TitansignificantlyimprovesfairnessbyallowingtheOStodynamically interactwiththeNICpacketscheduler• TitanisimplementableoncommodityNICs!

28

https://github.com/bestephe/titan

Titan: Fair Packet Scheduling for Commodity MultiqueueNICs · Titan: Fair Packet Scheduling for...

Documents

Transcript of Titan: Fair Packet Scheduling for Commodity MultiqueueNICs · Titan: Fair Packet Scheduling for...