6.888: Lecture 4 Data Center Load Balancing
Transcript of 6.888: Lecture 4 Data Center Load Balancing
Leaf
1000sofserverports
Multi-rooted tree [Fat-tree, Leaf-Spine, …] Ø Full bisection bandwidth, achieved via multipathing
Spine
Access
Single-rooted tree Ø High oversubscription
1000sofserverports
Agg
Core
MoDvaDon
2
DC networks need large bisection bandwidth for distributed apps (big data, HPC, web services, etc)
Leaf
1000sofserverports
Multi-rooted tree [Fat-tree, Leaf-Spine, …] Ø Full bisection bandwidth, achieved via multipathing
Spine
MoDvaDon
3
DC networks need large bisection bandwidth for distributed apps (big data, HPC, web services, etc)
MulD-rooted!=IdealDCNetwork
4
1000sofserverports
Ideal DC network: Big output-queued switch
Ø No internal bottlenecks è predictable Ø Simplifies BW management
[Bw guarantees, QoS, …]
Multi-rooted tree
1000sofserverports
Can’t build it L ≈Possible
bottlenecks Need efficient load balancing
Today:ECMPLoadBalancing
Pickamongequal-costpathsbyahashof5-tupleØ RandomizedloadbalancingØ Preservespacketorder
5
Problems: - Hash collisions (coarse granularity)
- Local & stateless (bad with asymmetry; e.g., due to link failures)
H(f)%3=0
SoluDonLandscape
6
Cong. Oblivious [ECMP, WCMP, packet-spray, …]
Cong. Aware [Flare, TeXCP, CONGA, DeTail, HULA, …]
Centralized Distributed [Hedera, Planck, Fastpass, …]
In-Network Host-Based
Cong. Oblivious [Presto]
Cong. Aware [MPTCP, FlowBender…]
WhatproblemisMPTCPtryingtosolve?MulDpath‘pools’links.
Two separate links A pool of links
=
TCPcontrolshowalinkisshared.Howshouldapoolbeshared?
11
100Mb/s
100Mb/s
2TCPs@25Mb/s
2MPTCPs@25Mb/s
4TCPs@25Mb/s
Thetotalcapacity,200Mb/s,issharedoutevenlybetweenall8flows.
ApplicaDon:MulDhomedwebserver
12
100Mb/s
100Mb/s
2TCPs@22Mb/s
3MPTCPs@22Mb/s
4TCPs@22Mb/s
Thetotalcapacity,200Mb/s,issharedoutevenlybetweenall9flows.
It’sasiftheywereallsharingasingle200Mb/slink.Thetwolinkscanbesaidtoforma200Mb/spool.
ApplicaDon:MulDhomedwebserver
13
100Mb/s
100Mb/s
2TCPs@20Mb/s
4MPTCPs@20Mb/s
4TCPs@20Mb/s
Thetotalcapacity,200Mb/s,issharedoutevenlybetweenall10flows.
It’sasiftheywereallsharingasingle200Mb/slink.Thetwolinkscanbesaidtoforma200Mb/spool.
ApplicaDon:MulDhomedwebserver
ApplicaDon:WIFI&cellulartogether
Howshouldyourphonebalanceitstrafficacrossverydifferentpaths?
14
wifipath:highloss,smallRTT
3Gpath:lowloss,highRTT
WhatistheMPTCPprotocol?MPTCPisareplacementforTCPwhichletsyouusemulDplepathssimultaneously.
17
TCP
IP
userspacesocketAPIMPTCP MPTCP
addr1 addr2addr
Thesenderstripespacketsacrosspaths
Thereceiverputsthepacketsinthecorrectorder
WhatistheMPTCPprotocol?MPTCPisareplacementforTCPwhichletsyouusemulDplepathssimultaneously.
18
TCP
IP
userspacesocketAPIMPTCP MPTCP
addraddr
Thesenderstripespacketsacrosspaths
Thereceiverputsthepacketsinthecorrectorder
portp1
portp2aswitchwithport-basedrouDng
Designgoal1:MulDpathTCPshouldbefairtoregularTCPatsharedbojlenecks
Tobefair,Mul.pathTCPshouldtakeasmuchcapacityasTCPatabo:lenecklink,noma:erhowmanypathsitisusing.
StrawmansoluDon:Run“½TCP”oneachpath
AmulDpathTCPflowwithtwosubflows
RegularTCP
19
Designgoal2:MPTCPshoulduseefficientpaths
Eachflowhasachoiceofa1-hopanda2-hoppath.Howshouldsplititstraffic?
12Mb/s
12Mb/s
12Mb/s
20
Designgoal2:MPTCPshoulduseefficientpaths
Ifeachflowsplititstraffic1:1...
8Mb/s
8Mb/s
8Mb/s
12Mb/s
12Mb/s
12Mb/s
21
Designgoal2:MPTCPshoulduseefficientpaths
Ifeachflowsplititstraffic2:1...
9Mb/s
9Mb/s
9Mb/s
12Mb/s
12Mb/s
12Mb/s
22
Designgoal2:MPTCPshoulduseefficientpaths
Ifeachflowsplititstraffic4:1...
10Mb/s
10Mb/s
10Mb/s
12Mb/s
12Mb/s
12Mb/s
23
Designgoal2:MPTCPshoulduseefficientpaths
Ifeachflowsplititstraffic∞:1...
12Mb/s
12Mb/s
12Mb/s
12Mb/s
12Mb/s
12Mb/s
24
Designgoal2:MPTCPshoulduseefficientpaths
12Mb/s
12Mb/s
12Mb/s
12Mb/s
12Mb/s
12Mb/s
TheoreDcalsoluDon(Kelly+Voice2005;Han,Towsleyetal.2006)MPTCPshouldsendallitstrafficonitsleast-congestedpaths.Theorem.ThiswillleadtothemostefficientallocaDonpossible,givenanetworktopologyandasetofavailablepaths.
25
Designgoal3:MPTCPshouldbefaircomparedtoTCP
DesignGoal2saystosendallyourtrafficontheleastcongestedpath,inthiscase3G.ButthishashighRTT,henceitwillgivelowthroughput.
cd
wifipath:highloss,smallRTT
3Gpath:lowloss,highRTT
Goal3a.AMulDpathTCPusershouldgetatleastasmuchthroughputasasingle-pathTCPwouldonthebestoftheavailablepaths.
Goal3b.AMulDpathTCPflowshouldtakenomorecapacityonanylinkthanasingle-pathTCPwould.
26
Designgoals
Goal1.BefairtoTCPatbojlenecklinksGoal2.Useefficientpaths...Goal3.asmuchaswecan,whilebeingfairtoTCPGoal4.AdaptquicklywhencongesDonchangesGoal5.Don’toscillate
HowdoesMPTCPachieveallthis?
27
Read:“Design,implementaDon,andevaluaDonofcongesDoncontrolformulDpathTCP,NSDI2011”
HowdoesTCPcongesDoncontrolwork?MaintainacongesDonwindoww.
• IncreasewforeachACK,by1/w
• Decreasewforeachdrop,byw/2
28
HowdoesMPTCPcongesDoncontrolwork?
MaintainacongesDonwindowwr,onewindowforeachpath,wherer ∊ Rrangesoverthesetofavailablepaths.
• IncreasewrforeachACKonpathr,by
• Decreasewrforeachdroponpathr,bywr /2
29
WhatYouSaid
Ravi:“Aninteres.ngpointintheMPTCPpaperisthattheytargeta'sweetspot'wherethereisafairamountoftrafficbutthecoreisneitheroverloadednorunderloaded.”
31
WhatYouSaid
Hongzi:“Thepapertalkedabitof`probing’toseeifalinkhashighloadandpicksomeotherlinks,andtherearesomespecificwaysofassigningrandomizedassignmentofsubflowsonlinks.Iwaswonderingdoesthe`powerof2choices’havesomerolestoplayhere?”
32
MPTCPdiscoversavailablecapacity,anditdoesn’tneedmuchpathchoice.
Ifeachnode-pairbalancesitstrafficover8paths,chosenatrandom,thenuDlizaDonisaround90%ofopDmal.
33
FatTree,128nodes FatTree,8192nodesThroughput(%ofop.mal)
Num.pathsSimulaDonsofFatTree,100Mb/slinks,permutaDontrafficmatrix,oneflowperhost,TCP+ECMPversusMPTCP.
MPTCPdiscoversavailablecapacity,anditsharesitoutmorefairlythanTCP+ECMP.
34
FatTree,128nodes FatTree,8192nodesThroughput(%ofop.mal)
FlowrankSimulaDonsofFatTree,100Mb/slinks,permutaDontrafficmatrix,oneflowperhost,TCP+ECMPversusMPTCP.
MPTCPcanmakegoodpathchoices,asgoodasaveryfastcentralizedscheduler.
35
SimulaDonofFatTreewith128hosts.• PermutaDontrafficmatrix
• Closed-loopflowarrivals(oneflowfinishes,anotherstarts)
• FlowsizedistribuDonsfromVL2dataset
Throughput[%ofop.mal]
Hederafirst-fitheuris.c
MPTCP
MPTCPpermitsflexibletopologies
BecauseanMPTCPflowshiusitstrafficontoitsleastcongestedpaths,congesDonhotspotsaremadeto“diffuse”throughoutthenetwork.Non-adapDvecongesDoncontrol,ontheotherhand,doesnotcopewellwithnon-homogenoustopologies.
36
Averagethroughput[%ofop.mal]
Rankofflow
SimulaDonof128-nodeFatTree,whenoneofthe1Gb/scorelinksiscutto100Mb/s
MPTCPpermitsflexibletopologies
• Atlowloads,therearefewcollisions,andNICsaresaturated,soTCP≈MPTCP• Athighloads,thecoreisseverelycongested,andTCPcanfullyexploitallthecorelinks,soTCP≈MPTCP• Whenthecoreis“right-provisioned”,i.e.justsaturated,MPTCP>TCP
37
Connec.onsperhost
Ra.oofthroughputs,MPTCP/TCP
SimulaDonofaFatTree-liketopologywith512nodes,butwith4hostsforeveryup-linkfromatop-of-rackswitch,i.e.thecoreisoversubscribed4:1.• PermutaDonTM:eachhostsendstooneother,eachhostreceivesfromoneother
• RandomTM:eachhostsendstooneother,eachhostmayreceivefromanynumber
MPTCPpermitsflexibletopologies
Ifonly50%ofhostsareacDve,you’dlikeeachhosttobeabletosendat2Gb/s,fasterthanoneNICcansupport.
38
FatTree(5portsperhostintotal,1Gb/sbisecDonbandwidth)
Dual-homedFatTree(5portsperhostintotal,1Gb/sbisecDonbandwidth)
1Gb/s 1Gb/s
SoluDonLandscape
40
Cong. Oblivious [ECMP, WCMP, packet-spray, …]
Cong. Aware [Flare, TeXCP, CONGA, HULA, DeTail…]
Centralized Distributed [Hedera, Planck, Fastpass, …]
In-Network Host-Based
Cong. Oblivious [Presto]
Cong. Aware [MPTCP, FlowBender…]
IscongesDon-awareloadbalancingoverkillfordatacenters?
We’vealreadyseenacongesDon-obliviousloadbalancingscheme
41
ECMP Prestoisfine-grainedcongesDon-obliviousloadbalancing
KeychallengeismakingthispracDcal
KeyDesignDecisions
Usesouwareedge• Nochangestotransport(e.g.,insideVMs)orswitches
LBgranularityflowcells[e.g.64KBofdata]• WorkswithTSOhardwareoffload• Noreorderingformice• Makesdealingwithreorderingsimpler(Why?)
“Fix”reorderingatGROlayer• Avoidhighper-packetprocessing(esp.at10Gb/sand
above)End-to-endpathcontrol
42
PrestoataHighLevel
44
vSwitch NIC NIC
vSwitch TCP/IP
Spine
Leaf
TCP/IP
Proac.velydistributedevenlyoversymmetricnetworkbyvSwitchsender
Nearuniform-sizeddataunits
PrestoataHighLevel
45
vSwitch NIC NIC
vSwitch TCP/IP
Spine
Leaf
TCP/IP
Proac.velydistributedevenlyoversymmetricnetworkbyvSwitchsender
Nearuniform-sizeddataunits
PrestoataHighLevel
46
vSwitch NIC NIC
vSwitch TCP/IP
Spine
Leaf
TCP/IPReceivermaskspacketreorderingduetomul.pathingbelowtransportlayer
Proac.velydistributedevenlyoversymmetricnetworkbyvSwitchsender
Nearuniform-sizeddataunits
WhatYouSaidArman:“Froman(informa.on)theore.cperspec.ve,ordershouldnotbesuchatroublingphenomenon,yetinrealnetworksorderingissoimportant.Howprac.calare“ratelesscodes”(networkcoding,raptorcodes,etc.)inallevia.ngthisproblem?”
Amy:“Themaincomplexi.esinthepaperstemfromtherequirementthatserversuseTSOandGROtoachievehighthroughput.Itissurprisingtomethatpeoples.llrelysoheavilyonTSOandGRO.Whydoesn'tsomeonebuildamul.-coreTCPstackthatcanprocessindividualpacketsatlinerateinsokware?”
48
PrestoLBGranularity
Presto:load-balanceonflowcellsWhatisflowcell?– AsetofTCPsegmentswithboundedbytecount– BoundismaximalTCPSegmentaDonOffload(TSO)size
• MaximizethebenefitofTSOforhighspeed• 64KBinimplementaDon
What’sTSO?
49
TCP/IP
NICSegmenta.on&ChecksumOffload
MTU-sizedEthernetFrames
LargeSegment
PrestoLBGranularity
Presto:load-balanceonflowcellsWhatisflowcell?– AsetofTCPsegmentswithboundedbytecount– BoundismaximalTCPSegmentaDonOffload(TSO)size
• MaximizethebenefitofTSOforhighspeed• 64KBinimplementaDon
Examples
50
25KB 30KB 30KB
Flowcell:55KB
TCPsegments
Start
IntrotoGRO
TCP/IP
GRO
NICMTU-sizedPackets
59
P1–P5 Push-up
LargeTCPsegmentsarepushed-upattheendofabatchedIOevent(i.e.,apollingevent)
IntrotoGRO
TCP/IP
GRO
NICMTU-sizedPackets
60
P1–P5 Push-up
MergingpktsinGROcreateslesssegments&avoidsusingsubstan.allymorecyclesatTCP/IPandabove[Menon,ATC’08]IfGROisdisabled,~6Gbpswith100%CPUusageofonecore
ReorderingChallenges
65
P1–P3 P6
P4 P7 P5 P8 P9
TCP/IP
GRO
NIC
GROisdesignedtobefastandsimple;itpushes-uptheexisDngsegmentimmediatelywhen1)thereisagapinsequencenumber,2)MSSreachedor3)Dmeoutfired
ReorderingChallenges
GROiseffec1velydisabledLotsofsmallpacketsarepusheduptoTCP/IP
73
HugeCPUprocessingoverhead
PoorTCPperformanceduetomassivereordering
40G
40G
40G
40G
40G
75
30G30G(UDP)
40G(TCP)
HandlingAsymmetry
HandlingasymmetryopDmallyneedstrafficawareness
35G
5G