Download - 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Transcript
Page 1: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

2.5ReliableTransmission

Aswesawintheprevioussection,framesaresometimescorruptedwhileintransit,withanerrorcodelikeCRCusedtodetectsucherrors.Whilesomeerrorcodesarestrongenoughalsotocorrecterrors,inpracticetheoverheadistypicallytoolargetohandletherangeofbitandbursterrorsthatcanbeintroducedonanetworklink.Evenwhenerror-correctingcodesareused(e.g.,onwirelesslinks)someerrorswillbetooseveretobecorrected.Asaresult,somecorruptframesmustbediscarded.Alink-levelprotocolthatwantstodeliverframesreliablymustsomehowrecoverfromthesediscarded(lost)frames.

It'sworthnotingthatreliabilityisafunctionthatmaybeprovidedatthelinklevel,butmanymodernlinktechnologiesomitthisfunction.Furthermore,reliabledeliveryisfrequentlyprovidedathigherlevels,includingbothtransportandsometimes,theapplicationlayer.Exactlywhereitshouldbeprovidedisamatterofsomedebateanddependsonmanyfactors.Wedescribethebasicsofreliabledeliveryhere,sincetheprinciplesarecommonacrosslayers,butyoushouldbeawarethatwe'renotjusttalkingaboutalink-layerfunction.

Reliabledeliveryisusuallyaccomplishedusingacombinationoftwofundamentalmechanisms—acknowledgmentsandtimeouts.Anacknowledgment(ACKforshort)isasmallcontrolframethataprotocolsendsbacktoitspeersayingthatithasreceivedanearlierframe.Bycontrolframewemeanaheaderwithoutanydata,althoughaprotocolcanpiggybackanACKonadataframeitjusthappenstobesendingintheoppositedirection.Thereceiptofanacknowledgmentindicatestothesenderoftheoriginalframethatitsframewassuccessfullydelivered.Ifthesenderdoesnotreceiveanacknowledgmentafterareasonableamountoftime,thenitretransmitstheoriginalframe.Thisactionofwaitingareasonableamountoftimeiscalledatimeout.

Thegeneralstrategyofusingacknowledgmentsandtimeoutstoimplementreliabledeliveryissometimescalledautomaticrepeatrequest(abbreviatedARQ).ThissectiondescribesthreedifferentARQalgorithmsusinggenericlanguage;thatis,wedonotgivedetailedinformationaboutaparticularprotocol'sheaderfields.

Stop-and-Wait

ThesimplestARQschemeisthestop-and-waitalgorithm.Theideaofstop-and-waitisstraightforward:Aftertransmittingoneframe,thesenderwaitsforanacknowledgmentbeforetransmittingthenextframe.Iftheacknowledgmentdoesnotarriveafteracertainperiodoftime,thesendertimesoutandretransmitstheoriginalframe.

2.5ReliableTransmission

62

Page 2: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Figure1.Timelineshowingfourdifferentscenariosforthestop-and-waitalgorithm.(a)TheACKisreceivedbeforethetimerexpires;(b)theoriginalframeislost;(c)theACKislost;(d)thetimeoutfirestoosoon.

Figure1illustratestimeslinesforfourdifferentscenariosthatresultfromthisbasicalgorithm.Thesendingsideisrepresentedontheleft,thereceivingsideisdepictedontheright,andtimeflowsfromtoptobottom.Figure1(a)showsthesituationinwhichtheACKisreceivedbeforethetimerexpires;(b)and(c)showthesituationinwhichtheoriginalframeandtheACK,respectively,arelost;and(d)showsthesituationinwhichthetimeoutfirestoosoon.Recallthatby"lost"wemeanthattheframewascorruptedwhileintransit,thatthiscorruptionwasdetectedbyanerrorcodeonthereceiver,andthattheframewassubsequentlydiscarded.

Thepackettimelinesshowninthissectionareexamplesofafrequentlyusedtoolinteaching,explaining,anddesigningprotocols.Theyareusefulbecausetheycapturevisuallythebehaviorovertimeofadistributedsystem—somethingthatcanbequitehardtoanalyze.Whendesigningaprotocol,youoftenhavetobepreparedfortheunexpected—asystemcrashes,amessagegetslost,orsomethingthatyouexpectedtohappenquicklyturnsouttotakealongtime.Thesesortsofdiagramscanoftenhelpusunderstandwhatmightgowronginsuchcasesandthushelpaprotocoldesignerbepreparedforeveryeventuality.

Thereisoneimportantsubtletyinthestop-and-waitalgorithm.Supposethesendersendsaframeandthereceiveracknowledgesit,buttheacknowledgmentiseitherlostordelayedinarriving.Thissituationisillustratedintimelines(c)and(d)ofFigure1.Inbothcases,thesendertimesoutandretransmitstheoriginalframe,butthereceiverwillthinkthatitisthenextframe,sinceitcorrectlyreceivedandacknowledgedthefirstframe.Thishasthepotentialtocauseduplicatecopiesofaframetobedelivered.Toaddressthisproblem,theheaderforastop-and-waitprotocolusuallyincludesa1-bitsequencenumber—thatis,thesequencenumbercantakeonthevalues0and1—andthesequencenumbersusedforeachframealternate,asillustratedinFigure2.Thus,whenthesenderretransmitsframe0,thereceivercandeterminethatitisseeingasecondcopyofframe0ratherthanthefirstcopyofframe1andthereforecanignoreit(thereceiverstillacknowledgesit,incasethefirstACKwaslost).

2.5ReliableTransmission

63

Page 3: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Figure2.Timelineforstop-and-waitwith1-bitsequencenumber.

Themainshortcomingofthestop-and-waitalgorithmisthatitallowsthesendertohaveonlyoneoutstandingframeonthelinkatatime,andthismaybefarbelowthelink'scapacity.Consider,forexample,a1.5-Mbpslinkwitha45-msround-triptime.Thislinkhasadelay×bandwidthproductof67.5Kb,orapproximately8KB.SincethesendercansendonlyoneframeperRTT,andassumingaframesizeof1KB,thisimpliesamaximumsendingrateof

Bits-Per-Frame/Time-Per-Frame=1024x8/0.045=182kbps

oraboutone-eighthofthelink'scapacity.Tousethelinkfully,then,we'dlikethesendertobeabletotransmituptoeightframesbeforehavingtowaitforanacknowledgment.

Thesignificanceofthedelay×bandwidthproductisthatitrepresentstheamountofdatathatcouldbeintransit.Wewouldliketobeabletosendthismuchdatawithoutwaitingforthefirstacknowledgment.Theprincipleatworkhereisoftenreferredtoaskeepingthepipefull.Thealgorithmspresentedinthefollowingtwosubsectionsdoexactlythis.

SlidingWindowConsideragainthescenarioinwhichthelinkhasadelay×bandwidthproductof8KBandframesare1KBinsize.WewouldlikethesendertobereadytotransmittheninthframeatprettymuchthesamemomentthattheACKforthefirstframearrives.Thealgorithmthatallowsustodothisiscalledslidingwindow,andanillustrativetimelineisgiveninFigure3.

Figure3.Timelinefortheslidingwindowalgorithm.

2.5ReliableTransmission

64

Page 4: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

TheSlidingWindowAlgorithm

Theslidingwindowalgorithmworksasfollows.First,thesenderassignsasequencenumber,denotedSeqNum,toeachframe.Fornow,let'signorethefactthatSeqNumisimplementedbyafinite-sizeheaderfieldandinsteadassumethatitcangrowinfinitelylarge.Thesendermaintainsthreevariables:Thesendwindowsize,denotedSWS,givestheupperboundonthenumberofoutstanding(unacknowledged)framesthatthesendercantransmit;LARdenotesthesequencenumberofthelastacknowledgmentreceived;andLFSdenotesthesequencenumberofthelastframesent.Thesenderalsomaintainsthefollowinginvariant:

LFS-LAR<=SWS

ThissituationisillustratedinFigure4.

Figure4.Slidingwindowonsender.

Whenanacknowledgmentarrives,thesendermovesLARtotheright,therebyallowingthesendertotransmitanotherframe.Also,thesenderassociatesatimerwitheachframeittransmits,anditretransmitstheframeshouldthetimerexpirebeforeanACKisreceived.NoticethatthesenderhastobewillingtobufferuptoSWSframessinceitmustbepreparedtoretransmitthemuntiltheyareacknowledged.

Thereceivermaintainsthefollowingthreevariables:Thereceivewindowsize,denotedRWS,givestheupperboundonthenumberofout-of-orderframesthatthereceiveriswillingtoaccept;LAFdenotesthesequencenumberofthelargestacceptableframe;andLFRdenotesthesequencenumberofthelastframereceived.Thereceiveralsomaintainsthefollowinginvariant:

LAF-LFR<=RWS

ThissituationisillustratedinFigure5.

Figure5.Slidingwindowonreceiver.

WhenaframewithsequencenumberSeqNumarrives,thereceivertakesthefollowingaction.IfSeqNum<=LFRorSeqNum>LAF,thentheframeisoutsidethereceiver'swindowanditisdiscarded.IfLFR<SeqNum<=LAF,thentheframeiswithinthereceiver'swindowanditisaccepted.NowthereceiverneedstodecidewhetherornottosendanACK.LetSeqNumToAckdenotethelargestsequencenumbernotyetacknowledged,suchthatallframeswithsequencenumberslessthanorequaltoSeqNumToAckhavebeenreceived.ThereceiveracknowledgesthereceiptofSeqNumToAck,evenifhighernumberedpacketshavebeenreceived.Thisacknowledgmentissaidtobecumulative.ItthensetsLFR=SeqNumToAckandadjustsLAF=LFR+RWS.

Forexample,supposeLFR=5(i.e.,thelastACKthereceiversentwasforsequencenumber5),andRWS=4.ThisimpliesthatLAF=9.Shouldframes7and8arrive,theywillbebufferedbecausetheyarewithinthereceiver'swindow.However,noACKneedstobesentsinceframe6hasyettoarrive.Frames7and8aresaidtohavearrivedoutoforder.(Technically,thereceivercouldresendanACKforframe5whenframes7and8arrive.)Shouldframe6

2.5ReliableTransmission

65

Page 5: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

thenarrive—perhapsitislatebecauseitwaslostthefirsttimeandhadtoberetransmitted,orperhapsitwassimplydelayed—thereceiveracknowledgesframe8,bumpsLFRto8,andsetsLAFto12.Ifframe6wasinfactlost,thenatimeoutwillhaveoccurredatthesender,causingittoretransmitframe6.

It'sunlikelythatapacketcouldbedelayedonapoint-to-pointlink,thissamealgorithmisusedonmulti-hopconnectionswheresuchdelaysarepossible.

Weobservethatwhenatimeoutoccurs,theamountofdataintransitdecreases,sincethesenderisunabletoadvanceitswindowuntilframe6isacknowledged.Thismeansthatwhenpacketlossesoccur,thisschemeisnolongerkeepingthepipefull.Thelongerittakestonoticethatapacketlosshasoccurred,themoreseverethisproblembecomes.

Noticethat,inthisexample,thereceivercouldhavesentanegativeacknowledgment(NAK)forframe6assoonasframe7arrived.However,thisisunnecessarysincethesender'stimeoutmechanismissufficienttocatchthissituation,andsendingNAKsaddsadditionalcomplexitytothereceiver.Also,aswementioned,itwouldhavebeenlegitimatetosendadditionalacknowledgmentsofframe5whenframes7and8arrived;insomecases,asendercanuseduplicateACKsasacluethataframewaslost.Bothapproacheshelptoimproveperformancebyallowingearlydetectionofpacketlosses.

Yetanothervariationonthisschemewouldbetouseselectiveacknowledgments.Thatis,thereceivercouldacknowledgeexactlythoseframesithasreceivedratherthanjustthehighestnumberedframereceivedinorder.So,intheaboveexample,thereceivercouldacknowledgethereceiptofframes7and8.Givingmoreinformationtothesendermakesitpotentiallyeasierforthesendertokeepthepipefullbutaddscomplexitytotheimplementation.

Thesendingwindowsizeisselectedaccordingtohowmanyframeswewanttohaveoutstandingonthelinkatagiventime;SWSiseasytocomputeforagivendelay×bandwidthproduct.Ontheotherhand,thereceivercansetRWStowhateveritwants.TwocommonsettingsareRWS=1,whichimpliesthatthereceiverwillnotbufferanyframesthatarriveoutoforder,andRWS=SWS,whichimpliesthatthereceivercanbufferanyoftheframesthesendertransmits.ItmakesnosensetosetRWS>SWSsinceit'simpossibleformorethanSWSframestoarriveoutoforder.

FiniteSequenceNumbersandSlidingWindow

Wenowreturntotheonesimplificationweintroducedintothealgorithm—ourassumptionthatsequencenumberscangrowinfinitelylarge.Inpractice,ofcourse,aframe'ssequencenumberisspecifiedinaheaderfieldofsomefinitesize.Forexample,a3-bitfieldmeansthatthereareeightpossiblesequencenumbers,0..7.Thismakesitnecessarytoreusesequencenumbersor,statedanotherway,sequencenumberswraparound.Thisintroducestheproblemofbeingabletodistinguishbetweendifferentincarnationsofthesamesequencenumbers,whichimpliesthatthenumberofpossiblesequencenumbersmustbelargerthanthenumberofoutstandingframesallowed.Forexample,stop-and-waitallowedoneoutstandingframeatatimeandhadtwodistinctsequencenumbers.

Supposewehaveonemorenumberinourspaceofsequencenumbersthanwehavepotentiallyoutstandingframes;thatis,SWS<=MaxSeqNum-1,whereMaxSeqNumisthenumberofavailablesequencenumbers.Isthissufficient?TheanswerdependsonRWS.IfRWS=1,thenMaxSeqNum>=SWS+1issufficient.IfRWSisequaltoSWS,thenhavingaMaxSeqNumjustonegreaterthanthesendingwindowsizeisnotgoodenough.Toseethis,considerthesituationinwhichwehavetheeightsequencenumbers0through7,andSWS=RWS=7.Supposethesendertransmitsframes0..6,theyaresuccessfullyreceived,buttheACKsarelost.Thereceiverisnowexpectingframes7,0..5,butthesendertimesoutandsendsframes0..6.Unfortunately,thereceiverisexpectingthesecondincarnationofframes0..5butgetsthefirstincarnationoftheseframes.Thisisexactlythesituationwewantedtoavoid.

ItturnsoutthatthesendingwindowsizecanbenomorethanhalfasbigasthenumberofavailablesequencenumberswhenRWS=SWS,orstatedmoreprecisely,

SWS<(MaxSeqNum+1)/2

2.5ReliableTransmission

66

Page 6: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Intuitively,whatthisissayingisthattheslidingwindowprotocolalternatesbetweenthetwohalvesofthesequencenumberspace,justasstop-and-waitalternatesbetweensequencenumbers0and1.Theonlydifferenceisthatitcontinuallyslidesbetweenthetwohalvesratherthandiscretelyalternatingbetweenthem.

NotethatthisruleisspecifictothesituationwhereRWS=SWS.WeleaveitasanexercisetodeterminethemoregeneralrulethatworksforarbitraryvaluesofRWSandSWS.Alsonotethattherelationshipbetweenthewindowsizeandthesequencenumberspacedependsonanassumptionthatissoobviousthatitiseasytooverlook,namelythatframesarenotreorderedintransit.Thiscannothappenonadirectpoint-to-pointlinksincethereisnowayforoneframetoovertakeanotherduringtransmission.However,wewillseetheslidingwindowalgorithmusedinadifferentenvironments,andwewillneedtodeviseanotherrule.

ImplementationofSlidingWindow

Thefollowingroutinesillustratehowwemightimplementthesendingandreceivingsidesoftheslidingwindowalgorithm.Theroutinesaretakenfromaworkingprotocolnamed,appropriatelyenough,SlidingWindowProtocol(SWP).Soasnottoconcernourselveswiththeadjacentprotocolsintheprotocolgraph,wedenotetheprotocolsittingaboveSWPasthehigh-levelprotocol(HLP)andtheprotocolsittingbelowSWPasthelink-levelprotocol(LLP).

Westartbydefiningapairofdatastructures.First,theframeheaderisverysimple:Itcontainsasequencenumber(SeqNum)andanacknowledgmentnumber(AckNum).ItalsocontainsaFlagsfieldthatindicateswhethertheframeisanACKorcarriesdata.

typedefu_charSwpSeqno;

typedefstruct{SwpSeqnoSeqNum;/*sequencenumberofthisframe*/SwpSeqnoAckNum;/*ackofreceivedframe*/u_charFlags;/*upto8bitsworthofflags*/}SwpHdr;

Next,thestateoftheslidingwindowalgorithmhasthefollowingstructure.Forthesendingsideoftheprotocol,thisstateincludesvariablesLARandLFS,asdescribedearlierinthissection,aswellasaqueuethatholdsframesthathavebeentransmittedbutnotyetacknowledged(sendQ).ThesendingstatealsoincludesacountingsemaphorecalledsendWindowNotFull.Wewillseehowthisisusedbelow,butgenerallyasemaphoreisasynchronizationprimitivethatsupportssemWaitandsemSignaloperations.EveryinvocationofsemSignalincrementsthesemaphoreby1,andeveryinvocationofsemWaitdecrementssby1,withthecallingprocessblocked(suspended)shoulddecrementingthesemaphorecauseitsvaluetobecomelessthan0.AprocessthatisblockedduringitscalltosemWaitwillbeallowedtoresumeassoonasenoughsemSignaloperationshavebeenperformedtoraisethevalueofthesemaphoreabove0.

Forthereceivingsideoftheprotocol,thestateincludesthevariableNFE.Thisisthenextframeexpected,theframewithasequencenumberonemorethatthelastframereceived(LFR),describedearlierinthissection.Thereisalsoaqueuethatholdsframesthathavebeenreceivedoutoforder(recvQ).Finally,althoughnotshown,thesenderandreceiverslidingwindowsizesaredefinedbyconstantsSWSandRWS,respectively.

typedefstruct{/*sendersidestate:*/SwpSeqnoLAR;/*seqnooflastACKreceived*/SwpSeqnoLFS;/*lastframesent*/SemaphoresendWindowNotFull;SwpHdrhdr;/*pre-initializedheader*/structsendQ_slot{Eventtimeout;/*eventassociatedwithsend-timeout*/Msgmsg;}sendQ[SWS];

/*receiversidestate:*/

2.5ReliableTransmission

67

Page 7: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

SwpSeqnoNFE;/*seqnoofnextframeexpected*/structrecvQ_slot{intreceived;/*ismsgvalid?*/Msgmsg;}recvQ[RWS];}SwpState;

ThesendingsideofSWPisimplementedbyproceduresendSWP.Thisroutineisrathersimple.First,semWaitcausesthisprocesstoblockonasemaphoreuntilitisOKtosendanotherframe.Onceallowedtoproceed,sendSWPsetsthesequencenumberintheframe'sheader,savesacopyoftheframeinthetransmitqueue(sendQ),schedulesatimeouteventtohandlethecaseinwhichtheframeisnotacknowledged,andsendstheframetothenext-lower-levelprotocol,whichwedenoteasLINK.

Onedetailworthnotingisthecalltostore_swp_hdrjustbeforethecalltomsgAddHdr.ThisroutinetranslatestheCstructurethatholdstheSWPheader(state->hdr)intoabytestringthatcanbesafelyattachedtothefrontofthemessage(hbuf).Thisroutine(notshown)musttranslateeachintegerfieldintheheaderintonetworkbyteorderandremoveanypaddingthatthecompilerhasaddedtotheCstructure.Theissueofbyteorderisanon-trivialissue,butfornowitisenoughtoassumethatthisroutineplacesthemostsignificantbitofamultiwordintegerinthebytewiththehighestaddress.

AnotherpieceofcomplexityinthisroutineistheuseofsemWaitandthesendWindowNotFullsemaphore.sendWindowNotFullisinitializedtothesizeofthesender'sslidingwindow,SWS(thisinitializationisnotshown).Eachtimethesendertransmitsaframe,thesemWaitoperationdecrementsthiscountandblocksthesendershouldthecountgoto0.EachtimeanACKisreceived,thesemSignaloperationinvokedindeliverSWP(seebelow)incrementsthiscount,thusunblockinganywaitingsender.

staticintsendSWP(SwpState*state,Msg*frame){structsendQ_slot*slot;hbuf[HLEN];

/*waitforsendwindowtoopen*/semWait(&state->sendWindowNotFull);state->hdr.SeqNum=++state->LFS;slot=&state->sendQ[state->hdr.SeqNum%SWS];store_swp_hdr(state->hdr,hbuf);msgAddHdr(frame,hbuf,HLEN);msgSaveCopy(&slot->msg,frame);slot->timeout=evSchedule(swpTimeout,slot,SWP_SEND_TIMEOUT);returnsend(LINK,frame);}

BeforecontinuingtothereceivesideofSWP,weneedtoreconcileaseeminginconsistency.Ontheonehand,wehavebeensayingthatahigh-levelprotocolinvokestheservicesofalow-levelprotocolbycallingthesendoperation,sowewouldexpectthataprotocolthatwantstosendamessageviaSWPwouldcallsend(SWP,packet).Ontheotherhand,theprocedurethatimplementsSWP'ssendoperationiscalledsendSWP,anditsfirstargumentisastatevariable(SwpState).Whatgives?Theansweristhattheoperatingsystemprovidesgluecodethattranslatesthegenericcalltosendintoaprotocol-specificcalltosendSWP.Thisgluecodemapsthefirstargumenttosend(themagicprotocolvariableSWP)intobothafunctionpointertosendSWPandapointertotheprotocolstatethatSWPneedstodoitsjob.Thereasonwehavethehigh-levelprotocolindirectlyinvoketheprotocol-specificfunctionthroughthegenericfunctioncallisthatwewanttolimithowmuchinformationthehigh-levelprotocolhascodedinitaboutthelow-levelprotocol.Thismakesiteasiertochangetheprotocolgraphconfigurationatsometimeinthefuture.

NowwemoveontoSWP'sprotocol-specificimplementationofthedeliveroperation,whichisgiveninproceduredeliverSWP.Thisroutineactuallyhandlestwodifferentkindsofincomingmessages:ACKsforframessentearlierfromthisnodeanddataframesarrivingatthisnode.Inasense,theACKhalfofthisroutineisthecounterparttothe

2.5ReliableTransmission

68

Page 8: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

sendersideofthealgorithmgiveninsendSWP.AdecisionastowhethertheincomingmessageisanACKoradataframeismadebycheckingtheFlagsfieldintheheader.NotethatthisparticularimplementationdoesnotsupportpiggybackingACKsondataframes.

WhentheincomingframeisanACK,deliverSWPsimplyfindstheslotinthetransmitqueue(sendQ)thatcorrespondstotheACK,cancelsthetimeoutevent,andfreestheframesavedinthatslot.ThisworkisactuallydoneinaloopsincetheACKmaybecumulative.TheonlyotherthingtonoticeaboutthiscaseisthecalltosubroutineswpInWindow.Thissubroutine,whichisgivenbelow,ensuresthatthesequencenumberfortheframebeingacknowledgediswithintherangeofACKsthatthesendercurrentlyexpectstoreceive.

Whentheincomingframecontainsdata,deliverSWPfirstcallsmsgStripHdrandload_swp_hdrtoextracttheheaderfromtheframe.Routineload_swp_hdristhecounterparttostore_swp_hdrdiscussedearlier;ittranslatesabytestringintotheCdatastructurethatholdstheSWPheader.deliverSWPthencallsswpInWindowtomakesurethesequencenumberoftheframeiswithintherangeofsequencenumbersthatitexpects.Ifitis,theroutineloopsoverthesetofconsecutiveframesithasreceivedandpassesthemuptothehigher-levelprotocolbyinvokingthedeliverHLProutine.ItalsosendsacumulativeACKbacktothesender,butdoessobyloopingoverthereceivequeue(itdoesnotusetheSeqNumToAckvariableusedintheprosedescriptiongivenearlierinthissection).

staticintdeliverSWP(SwpStatestate,Msg*frame){SwpHdrhdr;char*hbuf;

hbuf=msgStripHdr(frame,HLEN);load_swp_hdr(&hdr,hbuf)if(hdr->Flags&FLAG_ACK_VALID){/*receivedanacknowledgment—doSENDERside*/if(swpInWindow(hdr.AckNum,state->LAR+1,state->LFS)){do{structsendQ_slot*slot;

slot=&state->sendQ[++state->LAR%SWS];evCancel(slot->timeout);msgDestroy(&slot->msg);semSignal(&state->sendWindowNotFull);}while(state->LAR!=hdr.AckNum);}}

if(hdr.Flags&FLAG_HAS_DATA){structrecvQ_slot*slot;

/*receiveddatapacket—doRECEIVERside*/slot=&state->recvQ[hdr.SeqNum%RWS];if(!swpInWindow(hdr.SeqNum,state->NFE,state->NFE+RWS-1)){/*dropthemessage*/returnSUCCESS;}msgSaveCopy(&slot->msg,frame);slot->received=TRUE;if(hdr.SeqNum==state->NFE){Msgm;

while(slot->received){deliver(HLP,&slot->msg);msgDestroy(&slot->msg);

2.5ReliableTransmission

69

Page 9: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

slot->received=FALSE;slot=&state->recvQ[++state->NFE%RWS];}/*sendACK:*/prepare_ack(&m,state->NFE-1);send(LINK,&m);msgDestroy(&m);}}returnSUCCESS;}

Finally,swpInWindowisasimplesubroutinethatcheckstoseeifagivensequencenumberfallsbetweensomeminimumandmaximumsequencenumber.

staticboolswpInWindow(SwpSeqnoseqno,SwpSeqnomin,SwpSeqnomax){SwpSeqnopos,maxpos;

pos=seqno-min;/*pos*should*beinrange[0..MAX)*/maxpos=max-min+1;/*maxposisinrange[0..MAX]*/returnpos<maxpos;}

FrameOrderandFlowControl

Theslidingwindowprotocolisperhapsthebestknownalgorithmincomputernetworking.Whatiseasilyconfusingaboutthealgorithm,however,isthatitcanbeusedtoservethreedifferentroles.Thefirstroleistheonewehavebeenconcentratingoninthissection—toreliablydeliverframesacrossanunreliablelink.(Ingeneral,thealgorithmcanbeusedtoreliablydelivermessagesacrossanunreliablenetwork.)Thisisthecorefunctionofthealgorithm.

Thesecondrolethattheslidingwindowalgorithmcanserveistopreservetheorderinwhichframesaretransmitted.Thisiseasytodoatthereceiver—sinceeachframehasasequencenumber,thereceiverjustmakessurethatitdoesnotpassaframeuptothenext-higher-levelprotocoluntilithasalreadypassedupallframeswithasmallersequencenumber.Thatis,thereceiverbuffers(i.e.,doesnotpassalong)out-of-orderframes.Theversionoftheslidingwindowalgorithmdescribedinthissectiondoespreserveframeorder,althoughwecouldimagineavariationinwhichthereceiverpassesframestothenextprotocolwithoutwaitingforallearlierframestobedelivered.Aquestionweshouldaskourselvesiswhetherwereallyneedtheslidingwindowprotocoltokeeptheframesinorderatthelinklevel,orwhether,instead,thisfunctionalityshouldbeimplementedbyaprotocolhigherinthestack.

Thethirdrolethattheslidingwindowalgorithmsometimesplaysistosupportflowcontrol—afeedbackmechanismbywhichthereceiverisabletothrottlethesender.Suchamechanismisusedtokeepthesenderfromover-runningthereceiver—thatis,fromtransmittingmoredatathanthereceiverisabletoprocess.Thisisusuallyaccomplishedbyaugmentingtheslidingwindowprotocolsothatthereceivernotonlyacknowledgesframesithasreceivedbutalsoinformsthesenderofhowmanyframesithasroomtoreceive.Thenumberofframesthatthereceiveriscapableofreceivingcorrespondstohowmuchfreebufferspaceithas.Asinthecaseofordereddelivery,weneedtomakesurethatflowcontrolisnecessaryatthelinklevelbeforeincorporatingitintotheslidingwindowprotocol.

Oneimportantconcepttotakeawayfromthisdiscussionisthesystemdesignprinciplewecallseparationofconcerns.Thatis,youmustbecarefultodistinguishbetweendifferentfunctionsthataresometimesrolledtogetherinonemechanism,andyoumustmakesurethateachfunctionisnecessaryandbeingsupportedinthemosteffectiveway.Inthisparticularcase,reliabledelivery,ordereddelivery,andflowcontrolaresometimescombinedinasingleslidingwindowprotocol,andweshouldaskourselvesifthisistherightthingtodoatthelinklevel.

ConcurrentLogicalChannels

2.5ReliableTransmission

70

Page 10: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

ThedatalinkprotocolusedintheoriginalARPANETprovidesaninterestingalternativetotheslidingwindowprotocol,inthatitisabletokeepthepipefullwhilestillusingthesimplestop-and-waitalgorithm.Oneimportantconsequenceofthisapproachisthattheframessentoveragivenlinkarenotkeptinanyparticularorder.Theprotocolalsoimpliesnothingaboutflowcontrol.

TheideaunderlyingtheARPANETprotocol,whichwerefertoasconcurrentlogicalchannels,istomultiplexseverallogicalchannelsontoasinglepoint-to-pointlinkandtorunthestop-and-waitalgorithmoneachoftheselogicalchannels.Thereisnorelationshipmaintainedamongtheframessentonanyofthelogicalchannels,yetbecauseadifferentframecanbeoutstandingoneachoftheseverallogicalchannelsthesendercankeepthelinkfull.

Moreprecisely,thesenderkeeps3bitsofstateforeachchannel:aboolean,sayingwhetherthechanneliscurrentlybusy;the1-bitsequencenumbertousethenexttimeaframeissentonthislogicalchannel;andthenextsequencenumbertoexpectonaframethatarrivesonthischannel.Whenthenodehasaframetosend,itusesthelowestidlechannel,andotherwiseitbehavesjustlikestop-and-wait.

Inpractice,theARPANETsupported8logicalchannelsovereachgroundlinkand16overeachsatellitelink.Intheground-linkcase,theheaderforeachframeincludeda3-bitchannelnumberanda1-bitsequencenumber,foratotalof4bits.Thisisexactlythenumberofbitstheslidingwindowprotocolrequirestosupportupto8outstandingframesonthelinkwhenRWS=SWS.

2.5ReliableTransmission

71

Page 11: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

5.2ReliableByteStream(TCP)

IncontrasttoasimpledemultiplexingprotocollikeUDP,amoresophisticatedtransportprotocolisonethatoffersareliable,connection-oriented,byte-streamservice.Suchaservicehasprovenusefultoawideassortmentofapplicationsbecauseitfreestheapplicationfromhavingtoworryaboutmissingorreordereddata.TheInternet'sTransmissionControlProtocolisprobablythemostwidelyusedprotocolofthistype;itisalsothemostcarefullytuned.ItisforthesetworeasonsthatthissectionstudiesTCPindetail,althoughweidentifyanddiscussalternativedesignchoicesattheendofthesection.

Intermsofthepropertiesoftransportprotocolsgivenintheproblemstatementatthestartofthischapter,TCPguaranteesthereliable,in-orderdeliveryofastreamofbytes.Itisafull-duplexprotocol,meaningthateachTCPconnectionsupportsapairofbytestreams,oneflowingineachdirection.Italsoincludesaflow-controlmechanismforeachofthesebytestreamsthatallowsthereceivertolimithowmuchdatathesendercantransmitatagiventime.Finally,likeUDP,TCPsupportsademultiplexingmechanismthatallowsmultipleapplicationprogramsonanygivenhosttosimultaneouslycarryonaconversationwiththeirpeers.

Inadditiontotheabovefeatures,TCPalsoimplementsahighlytunedcongestion-controlmechanism.TheideaofthismechanismistothrottlehowfastTCPsendsdata,notforthesakeofkeepingthesenderfromover-runningthereceiver,butsoastokeepthesenderfromoverloadingthenetwork.AdescriptionofTCP'scongestion-controlmechanismispostponeduntilthenextchapter,wherewediscussitinthelargercontextofhownetworkresourcesarefairlyallocated.

Sincemanypeopleconfusecongestioncontrolandflowcontrol,werestatethedifference.Flowcontrolinvolvespreventingsendersfromover-runningthecapacityofreceivers.Congestioncontrolinvolvespreventingtoomuchdatafrombeinginjectedintothenetwork,therebycausingswitchesorlinkstobecomeoverloaded.Thus,flowcontrolisanend-to-endissue,whilecongestioncontrolisconcernedwithhowhostsandnetworksinteract.

End-to-EndIssues

AttheheartofTCPistheslidingwindowalgorithm.Eventhoughthisisthesamebasicalgorithmasisoftenusedatthelinklevel,becauseTCPrunsovertheInternetratherthanaphysicalpoint-to-pointlink,therearemanyimportantdifferences.ThissubsectionidentifiesthesedifferencesandexplainshowtheycomplicateTCP.ThefollowingsubsectionsthendescribehowTCPaddressestheseandothercomplications.

First,whereasthelink-levelslidingwindowalgorithmpresentedrunsoverasinglephysicallinkthatalwaysconnectsthesametwocomputers,TCPsupportslogicalconnectionsbetweenprocessesthatarerunningonanytwocomputersintheInternet.ThismeansthatTCPneedsanexplicitconnectionestablishmentphaseduringwhichthetwosidesoftheconnectionagreetoexchangedatawitheachother.Thisdifferenceisanalogoustohavingtodialuptheotherparty,ratherthanhavingadedicatedphoneline.TCPalsohasanexplicitconnectionteardownphase.Oneofthethingsthathappensduringconnectionestablishmentisthatthetwopartiesestablishsomesharedstatetoenabletheslidingwindowalgorithmtobegin.ConnectionteardownisneededsoeachhostknowsitisOKtofreethisstate.

Second,whereasasinglephysicallinkthatalwaysconnectsthesametwocomputershasafixedround-triptime(RTT),TCPconnectionsarelikelytohavewidelydifferentround-triptimes.Forexample,aTCPconnectionbetweenahostinSanFranciscoandahostinBoston,whichareseparatedbyseveralthousandkilometers,mighthaveanRTTof100ms,whileaTCPconnectionbetweentwohostsinthesameroom,onlyafewmetersapart,mighthaveanRTTofonly1ms.ThesameTCPprotocolmustbeabletosupportbothoftheseconnections.Tomakemattersworse,theTCPconnectionbetweenhostsinSanFranciscoandBostonmighthaveanRTTof100msat3a.m.,butanRTTof500msat3p.m.VariationsintheRTTareevenpossibleduringasingleTCPconnectionthatlastsonlyafew

5.2ReliableByteStream(TCP)

210

Page 12: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

minutes.Whatthismeanstotheslidingwindowalgorithmisthatthetimeoutmechanismthattriggersretransmissionsmustbeadaptive.(Certainly,thetimeoutforapoint-to-pointlinkmustbeasettableparameter,butitisnotnecessarytoadaptthistimerforaparticularpairofnodes.)

AthirddifferenceisthatpacketsmaybereorderedastheycrosstheInternet,butthisisnotpossibleonapoint-to-pointlinkwherethefirstpacketputintooneendofthelinkmustbethefirsttoappearattheotherend.Packetsthatareslightlyoutoforderdonotcauseaproblemsincetheslidingwindowalgorithmcanreorderpacketscorrectlyusingthesequencenumber.Therealissueishowfaroutoforderpacketscangetor,saidanotherway,howlateapacketcanarriveatthedestination.Intheworstcase,apacketcanbedelayedintheInternetuntiltheIPtimetolive(TTL)fieldexpires,atwhichtimethepacketisdiscarded(andhencethereisnodangerofitarrivinglate).KnowingthatIPthrowspacketsawayaftertheirTTLexpires,TCPassumesthateachpackethasamaximumlifetime.Theexactlifetime,knownasthemaximumsegmentlifetime(MSL),isanengineeringchoice.Thecurrentrecommendedsettingis120seconds.KeepinmindthatIPdoesnotdirectlyenforcethis120-secondvalue;itissimplyaconservativeestimatethatTCPmakesofhowlongapacketmightliveintheInternet.Theimplicationissignificant—TCPhastobepreparedforveryoldpacketstosuddenlyshowupatthereceiver,potentiallyconfusingtheslidingwindowalgorithm.

Fourth,thecomputersconnectedtoapoint-to-pointlinkaregenerallyengineeredtosupportthelink.Forexample,ifalink'sdelay×bandwidthproductiscomputedtobe8KB—meaningthatawindowsizeisselectedtoallowupto8KBofdatatobeunacknowledgedatagiventime—thenitislikelythatthecomputersateitherendofthelinkhavetheabilitytobufferupto8KBofdata.Designingthesystemotherwisewouldbesilly.Ontheotherhand,almostanykindofcomputercanbeconnectedtotheInternet,makingtheamountofresourcesdedicatedtoanyoneTCPconnectionhighlyvariable,especiallyconsideringthatanyonehostcanpotentiallysupporthundredsofTCPconnectionsatthesametime.ThismeansthatTCPmustincludeamechanismthateachsideusesto"learn"whatresources(e.g.,howmuchbufferspace)theothersideisabletoapplytotheconnection.Thisistheflowcontrolissue.

Fifth,becausethetransmittingsideofadirectlyconnectedlinkcannotsendanyfasterthanthebandwidthofthelinkallows,andonlyonehostispumpingdataintothelink,itisnotpossibletounknowinglycongestthelink.Saidanotherway,theloadonthelinkisvisibleintheformofaqueueofpacketsatthesender.Incontrast,thesendingsideofaTCPconnectionhasnoideawhatlinkswillbetraversedtoreachthedestination.Forexample,thesendingmachinemightbedirectlyconnectedtoarelativelyfastEthernet—andcapableofsendingdataatarateof10Gbps—butsomewhereoutinthemiddleofthenetwork,a1.5-Mbpslinkmustbetraversed.And,tomakemattersworse,databeinggeneratedbymanydifferentsourcesmightbetryingtotraversethissameslowlink.Thisleadstotheproblemofnetworkcongestion.Discussionofthistopicisdelayeduntilthenextchapter.

Weconcludethisdiscussionofend-to-endissuesbycomparingTCP'sapproachtoprovidingareliable/ordereddeliveryservicewiththeapproachusedbyvirtual-circut-basednetworkslikethehistoricallyimportantX.25network.InTCP,theunderlyingIPnetworkisassumedtobeunreliableandtodelivermessagesoutoforder;TCPusestheslidingwindowalgorithmonanend-to-endbasistoprovidereliable/ordereddelivery.Incontrast,X.25networksusetheslidingwindowprotocolwithinthenetwork,onahop-by-hopbasis.Theassumptionbehindthisapproachisthatifmessagesaredeliveredreliablyandinorderbetweeneachpairofnodesalongthepathbetweenthesourcehostandthedestinationhost,thentheend-to-endservicealsoguaranteesreliable/ordereddelivery.

Theproblemwiththislatterapproachisthatasequenceofhop-by-hopguaranteesdoesnotnecessarilyadduptoanend-to-endguarantee.First,ifaheterogeneouslink(say,anEthernet)isaddedtooneendofthepath,thenthereisnoguaranteethatthishopwillpreservethesameserviceastheotherhops.Second,justbecausetheslidingwindowprotocolguaranteesthatmessagesaredeliveredcorrectlyfromnodeAtonodeB,andthenfromnodeBtonodeC,itdoesnotguaranteethatnodeBbehavesperfectly.Forexample,networknodeshavebeenknowntointroduceerrorsintomessageswhiletransferringthemfromaninputbuffertoanoutputbuffer.Theyhavealsobeenknowntoaccidentallyreordermessages.Asaconsequenceofthesesmallwindowsofvulnerability,itisstillnecessarytoprovidetrueend-to-endcheckstoguaranteereliable/orderedservice,eventhoughthelowerlevelsofthesystemalsoimplementthatfunctionality.

5.2ReliableByteStream(TCP)

211

Page 13: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Thisdiscussionservestoillustrateoneofthemostimportantprinciplesinsystemdesign—theend-to-endargument.Inanutshell,theend-to-endargumentsaysthatafunction(inourexample,providingreliable/ordereddelivery)shouldnotbeprovidedinthelowerlevelsofthesystemunlessitcanbecompletelyandcorrectlyimplementedatthatlevel.Therefore,thisrulearguesinfavoroftheTCP/IPapproach.Thisruleisnotabsolute,however.Itdoesallowforfunctionstobeincompletelyprovidedatalowlevelasaperformanceoptimization.Thisiswhyitisperfectlyconsistentwiththeend-to-endargumenttoperformerrordetection(e.g.,CRC)onahop-by-hopbasis;detectingandretransmittingasinglecorruptpacketacrossonehopispreferabletohavingtoretransmitanentirefileend-to-end.

SegmentFormat

TCPisabyte-orientedprotocol,whichmeansthatthesenderwritesbytesintoaTCPconnectionandthereceiverreadsbytesoutoftheTCPconnection.Although"bytestream"describestheserviceTCPofferstoapplicationprocesses,TCPdoesnot,itself,transmitindividualbytesovertheInternet.Instead,TCPonthesourcehostbuffersenoughbytesfromthesendingprocesstofillareasonablysizedpacketandthensendsthispackettoitspeeronthedestinationhost.TCPonthedestinationhostthenemptiesthecontentsofthepacketintoareceivebuffer,andthereceivingprocessreadsfromthisbufferatitsleisure.ThissituationisillustratedinFigure1,which,forsimplicity,showsdataflowinginonlyonedirection.Rememberthat,ingeneral,asingleTCPconnectionsupportsbytestreamsflowinginbothdirections.

Figure1.HowTCPmanagesabytestream.

ThepacketsexchangedbetweenTCPpeersinFigure1arecalledsegments,sinceeachonecarriesasegmentofthebytestream.EachTCPsegmentcontainstheheaderschematicallydepictedinFigure2.Therelevanceofmostofthesefieldswillbecomeapparentthroughoutthissection.Fornow,wesimplyintroducethem.

5.2ReliableByteStream(TCP)

212

Page 14: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Figure2.TCPheaderformat.

TheSrcPortandDstPortfieldsidentifythesourceanddestinationports,respectively,justasinUDP.Thesetwofields,plusthesourceanddestinationIPaddresses,combinetouniquelyidentifyeachTCPconnection.Thatis,TCP'sdemuxkeyisgivenbythe4-tuple

(SrcPort,SrcIPAddr,DstPort,DstIPAddr)

NotethatbecauseTCPconnectionscomeandgo,itispossibleforaconnectionbetweenaparticularpairofportstobeestablished,usedtosendandreceivedata,andclosed,andthenatalatertimeforthesamepairofportstobeinvolvedinasecondconnection.Wesometimesrefertothissituationastwodifferentincarnationsofthesameconnection.

TheAcknowledgement,SequenceNum,andAdvertisedWindowfieldsareallinvolvedinTCP'sslidingwindowalgorithm.BecauseTCPisabyte-orientedprotocol,eachbyteofdatahasasequencenumber.TheSequenceNumfieldcontainsthesequencenumberforthefirstbyteofdatacarriedinthatsegment,andtheAcknowledgementandAdvertisedWindowfieldscarryinformationabouttheflowofdatagoingintheotherdirection.Tosimplifyourdiscussion,weignorethefactthatdatacanflowinbothdirections,andweconcentrateondatathathasaparticularSequenceNumflowinginonedirectionandAcknowledgementandAdvertisedWindowvaluesflowingintheoppositedirection,asillustratedinFigure3.Theuseofthesethreefieldsisdescribedmorefullylaterinthischapter.

Figure3.Simplifiedillustration(showingonlyonedirection)oftheTCPprocess,withdataflowinonedirectionandACKsintheother.

The6-bitFlagsfieldisusedtorelaycontrolinformationbetweenTCPpeers.ThepossibleflagsincludeSYN,FIN,RESET,PUSH,URG,andACK.TheSYNandFINflagsareusedwhenestablishingandterminatingaTCPconnection,respectively.Theiruseisdescribedinalatersection.TheACKflagissetanytimetheAcknowledgementfieldisvalid,implyingthatthereceivershouldpayattentiontoit.TheURGflagsignifiesthatthissegmentcontains

5.2ReliableByteStream(TCP)

213

Page 15: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

urgentdata.Whenthisflagisset,theUrgPtrfieldindicateswherethenonurgentdatacontainedinthissegmentbegins.Theurgentdataiscontainedatthefrontofthesegmentbody,uptoandincludingavalueofUrgPtrbytesintothesegment.ThePUSHflagsignifiesthatthesenderinvokedthepushoperation,whichindicatestothereceivingsideofTCPthatitshouldnotifythereceivingprocessofthisfact.Wediscusstheselasttwofeaturesmoreinalatersection.Finally,theRESETflagsignifiesthatthereceiverhasbecomeconfused—forexample,becauseitreceivedasegmentitdidnotexpecttoreceive—andsowantstoaborttheconnection.

Finally,theChecksumfieldisusedinexactlythesamewayasforUDP—itiscomputedovertheTCPheader,theTCPdata,andthepseudoheader,whichismadeupofthesourceaddress,destinationaddress,andlengthfieldsfromtheIPheader.ThechecksumisrequiredforTCPinbothIPv4andIPv6.Also,sincetheTCPheaderisofvariablelength(optionscanbeattachedafterthemandatoryfields),aHdrLenfieldisincludedthatgivesthelengthoftheheaderin32-bitwords.ThisfieldisalsoknownastheOffsetfield,sinceitmeasurestheoffsetfromthestartofthepackettothestartofthedata.

ConnectionEstablishmentandTermination

ATCPconnectionbeginswithaclient(caller)doinganactiveopentoaserver(callee).Assumingthattheserverhadearlierdoneapassiveopen,thetwosidesengageinanexchangeofmessagestoestablishtheconnection.(RecallfromChapter1thatapartywantingtoinitiateaconnectionperformsanactiveopen,whileapartywillingtoacceptaconnectiondoesapassiveopen.)Onlyafterthisconnectionestablishmentphaseisoverdothetwosidesbeginsendingdata.Likewise,assoonasaparticipantisdonesendingdata,itclosesonedirectionoftheconnection,whichcausesTCPtoinitiatearoundofconnectionterminationmessages.Noticethat,whileconnectionsetupisanasymmetricactivity(onesidedoesapassiveopenandtheothersidedoesanactiveopen),connectionteardownissymmetric(eachsidehastoclosetheconnectionindependently).Therefore,itispossibleforonesidetohavedoneaclose,meaningthatitcannolongersenddata,butfortheothersidetokeeptheotherhalfofthebidirectionalconnectionopenandtocontinuesendingdata.

Tobemoreprecise,connectionsetupcanbesymmetric,withbothsidestryingtoopentheconnectionatthesametime,butthecommoncaseisforonesidetodoanactiveopenandtheothersidetodoapassiveopen.

Three-WayHandshake

ThealgorithmusedbyTCPtoestablishandterminateaconnectioniscalledathree-wayhandshake.WefirstdescribethebasicalgorithmandthenshowhowitisusedbyTCP.Thethree-wayhandshakeinvolvestheexchangeofthreemessagesbetweentheclientandtheserver,asillustratedbythetimelinegiveninFigure4.

Figure4.Timelineforthree-wayhandshakealgorithm.

5.2ReliableByteStream(TCP)

214

Page 16: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Theideaisthattwopartieswanttoagreeonasetofparameters,which,inthecaseofopeningaTCPconnection,arethestartingsequencenumbersthetwosidesplantousefortheirrespectivebytestreams.Ingeneral,theparametersmightbeanyfactsthateachsidewantstheothertoknowabout.First,theclient(theactiveparticipant)sendsasegmenttotheserver(thepassiveparticipant)statingtheinitialsequencenumberitplanstouse(Flags=SYN,SequenceNum=x).Theserverthenrespondswithasinglesegmentthatbothacknowledgestheclient'ssequencenumber(Flags=ACK,Ack=x+1)andstatesitsownbeginningsequencenumber(Flags=SYN,SequenceNum=y).Thatis,boththeSYNandACKbitsaresetintheFlagsfieldofthissecondmessage.Finally,theclientrespondswithathirdsegmentthatacknowledgestheserver'ssequencenumber(Flags=ACK,Ack=y+1).ThereasonwhyeachsideacknowledgesasequencenumberthatisonelargerthantheonesentisthattheAcknowledgementfieldactuallyidentifiesthe"nextsequencenumberexpected,"therebyimplicitlyacknowledgingallearliersequencenumbers.Althoughnotshowninthistimeline,atimerisscheduledforeachofthefirsttwosegments,andiftheexpectedresponseisnotreceivedthesegmentisretransmitted.

Youmaybeaskingyourselfwhytheclientandserverhavetoexchangestartingsequencenumberswitheachotheratconnectionsetuptime.Itwouldbesimplerifeachsidesimplystartedatsome"well-known"sequencenumber,suchas0.Infact,theTCPspecificationrequiresthateachsideofaconnectionselectaninitialstartingsequencenumberatrandom.Thereasonforthisistoprotectagainsttwoincarnationsofthesameconnectionreusingthesamesequencenumberstoosoon—thatis,whilethereisstillachancethatasegmentfromanearlierincarnationofaconnectionmightinterferewithalaterincarnationoftheconnection.

State-TransitionDiagram

TCPiscomplexenoughthatitsspecificationincludesastate-transitiondiagram.AcopyofthisdiagramisgiveninFigure5.Thisdiagramshowsonlythestatesinvolvedinopeningaconnection(everythingaboveESTABLISHED)andinclosingaconnection(everythingbelowESTABLISHED).Everythingthatgoesonwhileaconnectionisopen—thatis,theoperationoftheslidingwindowalgorithm—ishiddenintheESTABLISHEDstate.

5.2ReliableByteStream(TCP)

215

Page 17: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Figure5.TCPstate-transitiondiagram.

TCP'sstate-transitiondiagramisfairlyeasytounderstand.EachboxdenotesastatethatoneendofaTCPconnectioncanfinditselfin.AllconnectionsstartintheCLOSEDstate.Astheconnectionprogresses,theconnectionmovesfromstatetostateaccordingtothearcs.Eacharcislabeledwithatagoftheformevent/action.Thus,ifaconnectionisintheLISTENstateandaSYNsegmentarrives(i.e.,asegmentwiththeSYNflagset),theconnectionmakesatransitiontotheSYN_RCVDstateandtakestheactionofreplyingwithanACK+SYNsegment.

Noticethattwokindsofeventstriggerastatetransition:(1)asegmentarrivesfromthepeer(e.g.,theeventonthearcfromLISTENtoSYN_RCVD),or(2)thelocalapplicationprocessinvokesanoperationonTCP(e.g.,theactiveopeneventonthearcfromCLOSEDtoSYN_SENT).Inotherwords,TCP'sstate-transitiondiagrameffectivelydefinesthesemanticsofbothitspeer-to-peerinterfaceanditsserviceinterface.Thesyntaxofthesetwointerfacesisgivenbythesegmentformat(asillustratedinFigure2)andbysomeapplicationprogramminginterface,suchasthesocketAPI,respectively.

Nowlet'stracethetypicaltransitionstakenthroughthediagraminFigure5.Keepinmindthatateachendoftheconnection,TCPmakesdifferenttransitionsfromstatetostate.Whenopeningaconnection,theserverfirstinvokesapassiveopenoperationonTCP,whichcausesTCPtomovetotheLISTENstate.Atsomelatertime,theclientdoesanactiveopen,whichcausesitsendoftheconnectiontosendaSYNsegmenttotheserverandtomovetotheSYN_SENTstate.WhentheSYNsegmentarrivesattheserver,itmovestotheSYN_RCVDstateandrespondswith

5.2ReliableByteStream(TCP)

216

Page 18: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

aSYN+ACKsegment.ThearrivalofthissegmentcausestheclienttomovetotheESTABLISHEDstateandtosendanACKbacktotheserver.WhenthisACKarrives,theserverfinallymovestotheESTABLISHEDstate.Inotherwords,wehavejusttracedthethree-wayhandshake.

Therearethreethingstonoticeabouttheconnectionestablishmenthalfofthestate-transitiondiagram.First,iftheclient'sACKtotheserverislost,correspondingtothethirdlegofthethree-wayhandshake,thentheconnectionstillfunctionscorrectly.ThisisbecausetheclientsideisalreadyintheESTABLISHEDstate,sothelocalapplicationprocesscanstartsendingdatatotheotherend.EachofthesedatasegmentswillhavetheACKflagset,andthecorrectvalueintheAcknowledgementfield,sotheserverwillmovetotheESTABLISHEDstatewhenthefirstdatasegmentarrives.ThisisactuallyanimportantpointaboutTCP—everysegmentreportswhatsequencenumberthesenderisexpectingtoseenext,evenifthisrepeatsthesamesequencenumbercontainedinoneormoreprevioussegments.

Thesecondthingtonoticeaboutthestate-transitiondiagramisthatthereisafunnytransitionoutoftheLISTENstatewheneverthelocalprocessinvokesasendoperationonTCP.Thatis,itispossibleforapassiveparticipanttoidentifybothendsoftheconnection(i.e.,itselfandtheremoteparticipantthatitiswillingtohaveconnecttoit),andthenforittochangeitsmindaboutwaitingfortheothersideandinsteadactivelyestablishtheconnection.Tothebestofourknowledge,thisisafeatureofTCPthatnoapplicationprocessactuallytakesadvantageof.

Thefinalthingtonoticeaboutthediagramisthearcsthatarenotshown.Specifically,mostofthestatesthatinvolvesendingasegmenttotheothersidealsoscheduleatimeoutthateventuallycausesthesegmenttobepresentiftheexpectedresponsedoesnothappen.Theseretransmissionsarenotdepictedinthestate-transitiondiagram.Ifafterseveraltriestheexpectedresponsedoesnotarrive,TCPgivesupandreturnstotheCLOSEDstate.

Turningourattentionnowtotheprocessofterminatingaconnection,theimportantthingtokeepinmindisthattheapplicationprocessonbothsidesoftheconnectionmustindependentlycloseitshalfoftheconnection.Ifonlyonesideclosestheconnection,thenthismeansithasnomoredatatosend,butitisstillavailabletoreceivedatafromtheotherside.Thiscomplicatesthestate-transitiondiagrambecauseitmustaccountforthepossibilitythatthetwosidesinvokethecloseoperatoratthesametime,aswellasthepossibilitythatfirstonesideinvokescloseandthen,atsomelatertime,theothersideinvokesclose.Thus,onanyonesidetherearethreecombinationsoftransitionsthatgetaconnectionfromtheESTABLISHEDstatetotheCLOSEDstate:

Thissideclosesfirst:ESTABLISHED→FIN_WAIT_1→FIN_WAIT_2→TIME_WAIT→CLOSED.

Theothersideclosesfirst:ESTABLISHED→CLOSE_WAIT→LAST_ACK→CLOSED.

Bothsidescloseatthesametime:ESTABLISHED→FIN_WAIT_1→CLOSING→TIME_WAIT→CLOSED.

Thereisactuallyafourth,althoughrare,sequenceoftransitionsthatleadstotheCLOSEDstate;itfollowsthearcfromFIN_WAIT_1toTIME_WAIT.Weleaveitasanexerciseforyoutofigureoutwhatcombinationofcircumstancesleadstothisfourthpossibility.

ThemainthingtorecognizeaboutconnectionteardownisthataconnectionintheTIME_WAITstatecannotmovetotheCLOSEDstateuntilithaswaitedfortwotimesthemaximumamountoftimeanIPdatagrammightliveintheInternet(i.e.,120seconds).Thereasonforthisisthat,whilethelocalsideoftheconnectionhassentanACKinresponsetotheotherside'sFINsegment,itdoesnotknowthattheACKwassuccessfullydelivered.Asaconsequence,theothersidemightretransmititsFINsegment,andthissecondFINsegmentmightbedelayedinthenetwork.IftheconnectionwereallowedtomovedirectlytotheCLOSEDstate,thenanotherpairofapplicationprocessesmightcomealongandopenthesameconnection(i.e.,usethesamepairofportnumbers),andthedelayedFINsegmentfromtheearlierincarnationoftheconnectionwouldimmediatelyinitiatetheterminationofthelaterincarnationofthatconnection.

SlidingWindowRevisited

5.2ReliableByteStream(TCP)

217

Page 19: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

WearenowreadytodiscussTCP'svariantoftheslidingwindowalgorithm,whichservesseveralpurposes:(1)itguaranteesthereliabledeliveryofdata,(2)itensuresthatdataisdeliveredinorder,and(3)itenforcesflowcontrolbetweenthesenderandthereceiver.TCP'suseoftheslidingwindowalgorithmisthesameasatthelinklevelinthecaseofthefirsttwoofthesethreefunctions.WhereTCPdiffersfromthelink-levelalgorithmisthatitfoldstheflow-controlfunctioninaswell.Inparticular,ratherthanhavingafixed-sizeslidingwindow,thereceiveradvertisesawindowsizetothesender.ThisisdoneusingtheAdvertisedWindowfieldintheTCPheader.ThesenderisthenlimitedtohavingnomorethanavalueofAdvertisedWindowbytesofunacknowledgeddataatanygiventime.ThereceiverselectsasuitablevalueforAdvertisedWindowbasedontheamountofmemoryallocatedtotheconnectionforthepurposeofbufferingdata.Theideaistokeepthesenderfromover-runningthereceiver'sbuffer.Wediscussthisatgreaterlengthbelow.

ReliableandOrderedDelivery

ToseehowthesendingandreceivingsidesofTCPinteractwitheachothertoimplementreliableandordereddelivery,considerthesituationillustratedinFigure6.TCPonthesendingsidemaintainsasendbuffer.Thisbufferisusedtostoredatathathasbeensentbutnotyetacknowledged,aswellasdatathathasbeenwrittenbythesendingapplicationbutnottransmitted.Onthereceivingside,TCPmaintainsareceivebuffer.Thisbufferholdsdatathatarrivesoutoforder,aswellasdatathatisinthecorrectorder(i.e.,therearenomissingbytesearlierinthestream)butthattheapplicationprocesshasnotyethadthechancetoread.

Figure6.RelationshipbetweenTCPsendbuffer(a)andreceivebuffer(b).

Tomakethefollowingdiscussionsimplertofollow,weinitiallyignorethefactthatboththebuffersandthesequencenumbersareofsomefinitesizeandhencewilleventuallywraparound.Also,wedonotdistinguishbetweenapointerintoabufferwhereaparticularbyteofdataisstoredandthesequencenumberforthatbyte.

Lookingfirstatthesendingside,threepointersaremaintainedintothesendbuffer,eachwithanobviousmeaning:LastByteAcked,LastByteSent,andLastByteWritten.Clearly,

LastByteAcked<=LastByteSent

sincethereceivercannothaveacknowledgedabytethathasnotyetbeensent,and

LastByteSent<=LastByteWritten

sinceTCPcannotsendabytethattheapplicationprocesshasnotyetwritten.AlsonotethatnoneofthebytestotheleftofLastByteAckedneedtobesavedinthebufferbecausetheyhavealreadybeenacknowledged,andnoneofthebytestotherightofLastByteWrittenneedtobebufferedbecausetheyhavenotyetbeengenerated.

5.2ReliableByteStream(TCP)

218

Page 20: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Asimilarsetofpointers(sequencenumbers)aremaintainedonthereceivingside:LastByteRead,NextByteExpected,andLastByteRcvd.Theinequalitiesarealittlelessintuitive,however,becauseoftheproblemofout-of-orderdelivery.Thefirstrelationship

LastByteRead<NextByteExpected

istruebecauseabytecannotbereadbytheapplicationuntilitisreceivedandallprecedingbyteshavealsobeenreceived.NextByteExpectedpointstothebyteimmediatelyafterthelatestbytetomeetthiscriterion.Second,

NextByteExpected<=LastByteRcvd+1

since,ifdatahasarrivedinorder,NextByteExpectedpointstothebyteafterLastByteRcvd,whereasifdatahasarrivedoutoforder,thenNextByteExpectedpointstothestartofthefirstgapinthedata,asinFigure6.NotethatbytestotheleftofLastByteReadneednotbebufferedbecausetheyhavealreadybeenreadbythelocalapplicationprocess,andbytestotherightofLastByteRcvdneednotbebufferedbecausetheyhavenotyetarrived.

FlowControl

Mostoftheabovediscussionissimilartothatfoundinthestandardslidingwindowalgorithm;theonlyrealdifferenceisthatthistimeweelaboratedonthefactthatthesendingandreceivingapplicationprocessesarefillingandemptyingtheirlocalbuffer,respectively.(Theearlierdiscussionglossedoverthefactthatdataarrivingfromanupstreamnodewasfillingthesendbufferanddatabeingtransmittedtoadownstreamnodewasemptyingthereceivebuffer.)

Youshouldmakesureyouunderstandthismuchbeforeproceedingbecausenowcomesthepointwherethetwoalgorithmsdiffermoresignificantly.Inwhatfollows,wereintroducethefactthatbothbuffersareofsomefinitesize,denotedMaxSendBufferandMaxRcvBuffer,althoughwedon'tworryaboutthedetailsofhowtheyareimplemented.Inotherwords,weareonlyinterestedinthenumberofbytesbeingbuffered,notinwherethosebytesareactuallystored.

Recallthatinaslidingwindowprotocol,thesizeofthewindowsetstheamountofdatathatcanbesentwithoutwaitingforacknowledgmentfromthereceiver.Thus,thereceiverthrottlesthesenderbyadvertisingawindowthatisnolargerthantheamountofdatathatitcanbuffer.ObservethatTCPonthereceivesidemustkeep

LastByteRcvd-LastByteRead<=MaxRcvBuffer

toavoidoverflowingitsbuffer.Itthereforeadvertisesawindowsizeof

AdvertisedWindow=MaxRcvBuffer-((NextByteExpected-1)-LastByteRead)

whichrepresentstheamountoffreespaceremaininginitsbuffer.Asdataarrives,thereceiveracknowledgesitaslongasalltheprecedingbyteshavealsoarrived.Inaddition,LastByteRcvdmovestotheright(isincremented),meaningthattheadvertisedwindowpotentiallyshrinks.Whetherornotitshrinksdependsonhowfastthelocalapplicationprocessisconsumingdata.Ifthelocalprocessisreadingdatajustasfastasitarrives(causingLastByteReadtobeincrementedatthesamerateasLastByteRcvd),thentheadvertisedwindowstaysopen(i.e.,AdvertisedWindow=MaxRcvBuffer).If,however,thereceivingprocessfallsbehind,perhapsbecauseitperformsaveryexpensiveoperationoneachbyteofdatathatitreads,thentheadvertisedwindowgrowssmallerwitheverysegmentthatarrives,untiliteventuallygoesto0.

TCPonthesendsidemustthenadheretotheadvertisedwindowitgetsfromthereceiver.Thismeansthatatanygiventime,itmustensurethat

LastByteSent-LastByteAcked<=AdvertisedWindow

5.2ReliableByteStream(TCP)

219

Page 21: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Saidanotherway,thesendercomputesaneffectivewindowthatlimitshowmuchdataitcansend:

EffectiveWindow=AdvertisedWindow-(LastByteSent-LastByteAcked)

Clearly,EffectiveWindowmustbegreaterthan0beforethesourcecansendmoredata.Itispossible,therefore,thatasegmentarrivesacknowledgingxbytes,therebyallowingthesendertoincrementLastByteAckedbyx,butbecausethereceivingprocesswasnotreadinganydata,theadvertisedwindowisnowxbytessmallerthanthetimebefore.Insuchasituation,thesenderwouldbeabletofreebufferspace,butnottosendanymoredata.

Allthewhilethisisgoingon,thesendsidemustalsomakesurethatthelocalapplicationprocessdoesnotoverflowthesendbuffer—thatis,

LastByteWritten-LastByteAcked<=MaxSendBuffer

IfthesendingprocesstriestowriteybytestoTCP,but

(LastByteWritten-LastByteAcked)+y>MaxSendBuffer

thenTCPblocksthesendingprocessanddoesnotallowittogeneratemoredata.

Itisnowpossibletounderstandhowaslowreceivingprocessultimatelystopsafastsendingprocess.First,thereceivebufferfillsup,whichmeanstheadvertisedwindowshrinksto0.Anadvertisedwindowof0meansthatthesendingsidecannottransmitanydata,eventhoughdataithaspreviouslysenthasbeensuccessfullyacknowledged.Finally,notbeingabletotransmitanydatameansthatthesendbufferfillsup,whichultimatelycausesTCPtoblockthesendingprocess.Assoonasthereceivingprocessstartstoreaddataagain,thereceive-sideTCPisabletoopenitswindowbackup,whichallowsthesend-sideTCPtotransmitdataoutofitsbuffer.Whenthisdataiseventuallyacknowledged,LastByteAckedisincremented,thebufferspaceholdingthisacknowledgeddatabecomesfree,andthesendingprocessisunblockedandallowedtoproceed.

Thereisonlyoneremainingdetailthatmustberesolved—howdoesthesendingsideknowthattheadvertisedwindowisnolonger0?Asmentionedabove,TCPalwayssendsasegmentinresponsetoareceiveddatasegment,andthisresponsecontainsthelatestvaluesfortheAcknowledgeandAdvertisedWindowfields,evenifthesevalueshavenotchangedsincethelasttimetheyweresent.Theproblemisthis.Oncethereceivesidehasadvertisedawindowsizeof0,thesenderisnotpermittedtosendanymoredata,whichmeansithasnowaytodiscoverthattheadvertisedwindowisnolonger0atsometimeinthefuture.TCPonthereceivesidedoesnotspontaneouslysendnondatasegments;itonlysendstheminresponsetoanarrivingdatasegment.

TCPdealswiththissituationasfollows.Whenevertheothersideadvertisesawindowsizeof0,thesendingsidepersistsinsendingasegmentwith1byteofdataeverysooften.Itknowsthatthisdatawillprobablynotbeaccepted,butittriesanyway,becauseeachofthese1-bytesegmentstriggersaresponsethatcontainsthecurrentadvertisedwindow.Eventually,oneofthese1-byteprobestriggersaresponsethatreportsanonzeroadvertisedwindow.

NotethatthereasonthesendingsideperiodicallysendsthisprobesegmentisthatTCPisdesignedtomakethereceivesideassimpleaspossible—itsimplyrespondstosegmentsfromthesender,anditneverinitiatesanyactivityonitsown.Thisisanexampleofawell-recognized(althoughnotuniversallyapplied)protocoldesignrule,which,forlackofabettername,wecallthesmartsender/dumbreceiverrule.RecallthatwesawanotherexampleofthisrulewhenwediscussedtheuseofNAKsinslidingwindowalgorithm.

ProtectingagainstWraparound

ThissubsectionandthenextconsiderthesizeoftheSequenceNumandAdvertisedWindowfieldsandtheimplicationsoftheirsizesonTCP'scorrectnessandperformance.TCP'sSequenceNumfieldis32bitslong,anditsAdvertisedWindowfieldis16bitslong,meaningthatTCPhaseasilysatisfiedtherequirementoftheslidingwindowalgorithmthatthe

5.2ReliableByteStream(TCP)

220

Page 22: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

sequencenumberspacebetwiceasbigasthewindowsize:2 >>2×2 .However,thisrequirementisnotthe

interestingthingaboutthesetwofields.Considereachfieldinturn.

Therelevanceofthe32-bitsequencenumberspaceisthatthesequencenumberusedonagivenconnectionmightwraparound—abytewithsequencenumberScouldbesentatonetime,andthenatalatertimeasecondbytewiththesamesequencenumberSmightbesent.Onceagain,weassumethatpacketscannotsurviveintheInternetforlongerthantherecommendedMSL.Thus,wecurrentlyneedtomakesurethatthesequencenumberdoesnotwraparoundwithina120-secondperiodoftime.WhetherornotthishappensdependsonhowfastdatacanbetransmittedovertheInternet—thatis,howfastthe32-bitsequencenumberspacecanbeconsumed.(Thisdiscussionassumesthatwearetryingtoconsumethesequencenumberspaceasfastaspossible,butofcoursewewillbeifwearedoingourjobofkeepingthepipefull.)Table1showshowlongittakesforthesequencenumbertowraparoundonnetworkswithvariousbandwidths.

Bandwidth TimeuntilWraparound

T1(1.5Mbps) 6.4hours

Ethernet(10Mbps) 57minutes

T3(45Mbps) 13minutes

FastEthernet(100Mbps) 6minutes

OC-3(155Mbps) 4minutes

OC-48(2.5Gbps) 14seconds

OC-192(10Gbps) 3seconds

10GigE(10Gbps) 3seconds

Table1.TimeUntil32-BitSequenceNumberSpaceWrapsAround.

Asyoucansee,the32-bitsequencenumberspaceisadequateatmodestbandwidths,butgiventhatOC-192linksarenowcommonintheInternetbackbone,andthatmostserversnowcomewith10GigEthernet(or10Gbps)interfaces,we'renowwell-pastthepointwhere32bitsistoosmall.Fortunately,theIETFhasworkedoutanextensiontoTCPthateffectivelyextendsthesequencenumberspacetoprotectagainstthesequencenumberwrappingaround.Thisandrelatedextensionsaredescribedinalatersection.

KeepingthePipeFull

Therelevanceofthe16-bitAdvertisedWindowfieldisthatitmustbebigenoughtoallowthesendertokeepthepipefull.Clearly,thereceiverisfreetonotopenthewindowaslargeastheAdvertisedWindowfieldallows;weareinterestedinthesituationinwhichthereceiverhasenoughbufferspacetohandleasmuchdataasthelargestpossibleAdvertisedWindowallows.

Inthiscase,itisnotjustthenetworkbandwidthbutthedelay×bandwidthproductthatdictateshowbigtheAdvertisedWindowfieldneedstobe—thewindowneedstobeopenedfarenoughtoallowafulldelay×bandwidthproduct'sworthofdatatobetransmitted.AssuminganRTTof100ms(atypicalnumberforacross-countryconnectionintheUnitedStates),Table2givesthedelay×bandwidthproductforseveralnetworktechnologies.

Bandwidth Delay×BandwidthProduct

T1(1.5Mbps) 18KB

Ethernet(10Mbps) 122KB

T3(45Mbps) 549KB

FastEthernet(100Mbps) 1.2MB

32 16

5.2ReliableByteStream(TCP)

221

Page 23: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

OC-3(155Mbps) 1.8MB

OC-48(2.5Gbps) 29.6MB

OC-192(10Gbps) 118.4MB

10GigE(10Gbps) 118.4MB

Table2.RequiredWindowSizefor100-msRTT.

Asyoucansee,TCP'sAdvertisedWindowfieldisinevenworseshapethanitsSequenceNumfield—itisnotbigenoughtohandleevenaT3connectionacrossthecontinentalUnitedStates,sincea16-bitfieldallowsustoadvertiseawindowofonly64KB.TheverysameTCPextensionmentionedaboveprovidesamechanismforeffectivelyincreasingthesizeoftheadvertisedwindow.

TriggeringTransmission

Wenextconsiderasurprisinglysubtleissue:howTCPdecidestotransmitasegment.Asdescribedearlier,TCPsupportsabyte-streamabstraction;thatis,applicationprogramswritebytesintothestream,anditisuptoTCPtodecidethatithasenoughbytestosendasegment.Whatfactorsgovernthisdecision?

Ifweignorethepossibilityofflowcontrol—thatis,weassumethewindowiswideopen,aswouldbethecasewhenaconnectionfirststarts—thenTCPhasthreemechanismstotriggerthetransmissionofasegment.First,TCPmaintainsavariable,typicallycalledthemaximumsegmentsize(MSS),anditsendsasegmentassoonasithascollectedMSSbytesfromthesendingprocess.MSSisusuallysettothesizeofthelargestsegmentTCPcansendwithoutcausingthelocalIPtofragment.Thatis,MSSissettothemaximumtransmissionunit(MTU)ofthedirectlyconnectednetwork,minusthesizeoftheTCPandIPheaders.ThesecondthingthattriggersTCPtotransmitasegmentisthatthesendingprocesshasexplicitlyaskedittodoso.Specifically,TCPsupportsapushoperation,andthesendingprocessinvokesthisoperationtoeffectivelyflushthebufferofunsentbytes.Thefinaltriggerfortransmittingasegmentisthatatimerfires;theresultingsegmentcontainsasmanybytesasarecurrentlybufferedfortransmission.However,aswewillsoonsee,this"timer"isn'texactlywhatyouexpect.

SillyWindowSyndrome

Ofcourse,wecan'tjustignoreflowcontrol,whichplaysanobviousroleinthrottlingthesender.IfthesenderhasMSSbytesofdatatosendandthewindowisopenatleastthatmuch,thenthesendertransmitsafullsegment.Suppose,however,thatthesenderisaccumulatingbytestosend,butthewindowiscurrentlyclosed.NowsupposeanACKarrivesthateffectivelyopensthewindowenoughforthesendertotransmit,say,MSS/2bytes.Shouldthesendertransmitahalf-fullsegmentorwaitforthewindowtoopentoafullMSS?Theoriginalspecificationwassilentonthispoint,andearlyimplementationsofTCPdecidedtogoaheadandtransmitahalf-fullsegment.Afterall,thereisnotellinghowlongitwillbebeforethewindowopensfurther.

Itturnsoutthatthestrategyofaggressivelytakingadvantageofanyavailablewindowleadstoasituationnowknownasthesillywindowsyndrome.Figure7helpsvisualizewhathappens.IfyouthinkofaTCPstreamasaconveyerbeltwith"full"containers(datasegments)goinginonedirectionandemptycontainers(ACKs)goinginthereversedirection,thenMSS-sizedsegmentscorrespondtolargecontainersand1-bytesegmentscorrespondtoverysmallcontainers.AslongasthesenderissendingMSS-sizedsegmentsandthereceiverACKsatleastoneMSSofdataatatime,everythingisgood(Figure7(a)).But,whatifthereceiverhastoreducethewindow,sothatatsometimethesendercan'tsendafullMSSofdata?Ifthesenderaggressivelyfillsasmaller-than-MSSemptycontainerassoonasitarrives,thenthereceiverwillACKthatsmallernumberofbytes,andhencethesmallcontainerintroducedintothesystemremainsinthesystemindefinitely.Thatis,itisimmediatelyfilledandemptiedateachendandisnevercoalescedwithadjacentcontainerstocreatelargercontainers,asinFigure7(b).ThisscenariowasdiscoveredwhenearlyimplementationsofTCPregularlyfoundthemselvesfillingthenetworkwithtinysegments.

5.2ReliableByteStream(TCP)

222

Page 24: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Figure7.Sillywindowsyndrome.(a)AslongasthesendersendsMSS-sizedsegmentsandthereceiverACKsoneMSSatatime,thesystemworkssmoothly.(b)AssoonasthesendersendslessthanoneMSS,orthereceiverACKs

lessthanoneMSS,asmall"container"entersthesystemandcontinuestocirculate.

Notethatthesillywindowsyndromeisonlyaproblemwheneitherthesendertransmitsasmallsegmentorthereceiveropensthewindowasmallamount.Ifneitherofthesehappens,thenthesmallcontainerisneverintroducedintothestream.It'snotpossibletooutlawsendingsmallsegments;forexample,theapplicationmightdoapushaftersendingasinglebyte.Itispossible,however,tokeepthereceiverfromintroducingasmallcontainer(i.e.,asmallopenwindow).TheruleisthatafteradvertisingazerowindowthereceivermustwaitforspaceequaltoanMSSbeforeitadvertisesanopenwindow.

Sincewecan'teliminatethepossibilityofasmallcontainerbeingintroducedintothestream,wealsoneedmechanismstocoalescethem.ThereceivercandothisbydelayingACKs—sendingonecombinedACKratherthanmultiplesmallerones—butthisisonlyapartialsolutionbecausethereceiverhasnowayofknowinghowlongitissafetodelaywaitingeitherforanothersegmenttoarriveorfortheapplicationtoreadmoredata(thusopeningthewindow).Theultimatesolutionfallstothesender,whichbringsusbacktoouroriginalissue:WhendoestheTCPsenderdecidetotransmitasegment?

Nagle'sAlgorithm

ReturningtotheTCPsender,ifthereisdatatosendbutthewindowisopenlessthanMSS,thenwemaywanttowaitsomeamountoftimebeforesendingtheavailabledata,butthequestionishowlong?Ifwewaittoolong,thenwehurtinteractiveapplicationslikeTelnet.Ifwedon'twaitlongenough,thenwerisksendingabunchoftinypacketsandfallingintothesillywindowsyndrome.Theansweristointroduceatimerandtotransmitwhenthetimerexpires.

Whilewecoulduseaclock-basedtimer—forexample,onethatfiresevery100ms—Nagleintroducedanelegantself-clockingsolution.TheideaisthataslongasTCPhasanydatainflight,thesenderwilleventuallyreceiveanACK.ThisACKcanbetreatedlikeatimerfiring,triggeringthetransmissionofmoredata.Nagle'salgorithmprovidesasimple,unifiedrulefordecidingwhentotransmit:

Whentheapplicationproducesdatatosendifboththeavailabledataandthewindow>=MSSsendafullsegmentelseifthereisunACKeddatainflightbufferthenewdatauntilanACKarriveselsesendallthenewdatanow

5.2ReliableByteStream(TCP)

223

Page 25: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Inotherwords,it'salwaysOKtosendafullsegmentifthewindowallows.It'salsoallrighttoimmediatelysendasmallamountofdataiftherearecurrentlynosegmentsintransit,butifthereisanythinginflightthesendermustwaitforanACKbeforetransmittingthenextsegment.Thus,aninteractiveapplicationlikeTelnetthatcontinuallywritesonebyteatatimewillsenddataatarateofonesegmentperRTT.Somesegmentswillcontainasinglebyte,whileotherswillcontainasmanybytesastheuserwasabletotypeinoneround-triptime.BecausesomeapplicationscannotaffordsuchadelayforeachwriteitdoestoaTCPconnection,thesocketinterfaceallowstheapplicationtoturnoffNagel'salgorithmbysettingtheTCP_NODELAYoption.Settingthisoptionmeansthatdataistransmittedassoonaspossible.

AdaptiveRetransmissionBecauseTCPguaranteesthereliabledeliveryofdata,itretransmitseachsegmentifanACKisnotreceivedinacertainperiodoftime.TCPsetsthistimeoutasafunctionoftheRTTitexpectsbetweenthetwoendsoftheconnection.Unfortunately,giventherangeofpossibleRTTsbetweenanypairofhostsintheInternet,aswellasthevariationinRTTbetweenthesametwohostsovertime,choosinganappropriatetimeoutvalueisnotthateasy.Toaddressthisproblem,TCPusesanadaptiveretransmissionmechanism.WenowdescribethismechanismandhowithasevolvedovertimeastheInternetcommunityhasgainedmoreexperienceusingTCP.

OriginalAlgorithm

Webeginwithasimplealgorithmforcomputingatimeoutvaluebetweenapairofhosts.ThisisthealgorithmthatwasoriginallydescribedintheTCPspecification—andthefollowingdescriptionpresentsitinthoseterms—butitcouldbeusedbyanyend-to-endprotocol.

TheideaistokeeparunningaverageoftheRTTandthentocomputethetimeoutasafunctionofthisRTT.Specifically,everytimeTCPsendsadatasegment,itrecordsthetime.WhenanACKforthatsegmentarrives,TCPreadsthetimeagain,andthentakesthedifferencebetweenthesetwotimesasaSampleRTT.TCPthencomputesanEstimatedRTTasaweightedaveragebetweenthepreviousestimateandthisnewsample.Thatis,

EstimatedRTT=alphaxEstimatedRTT+(1-alpha)xSampleRTT

TheparameteralphaisselectedtosmooththeEstimatedRTT.AsmallalphatrackschangesintheRTTbutisperhapstooheavilyinfluencedbytemporaryfluctuations.Ontheotherhand,alargealphaismorestablebutperhapsnotquickenoughtoadapttorealchanges.TheoriginalTCPspecificationrecommendedasettingofalphabetween0.8and0.9.TCPthenusesEstimatedRTTtocomputethetimeoutinaratherconservativeway:

TimeOut=2xEstimatedRTT

Karn/PartridgeAlgorithm

AfterseveralyearsofuseontheInternet,aratherobviousflawwasdiscoveredinthissimplealgorithm.TheproblemwasthatanACKdoesnotreallyacknowledgeatransmission;itactuallyacknowledgesthereceiptofdata.Inotherwords,wheneverasegmentisretransmittedandthenanACKarrivesatthesender,itisimpossibletodetermineifthisACKshouldbeassociatedwiththefirstorthesecondtransmissionofthesegmentforthepurposeofmeasuringthesampleRTT.ItisnecessarytoknowwhichtransmissiontoassociateitwithsoastocomputeanaccurateSampleRTT.AsillustratedinFigure8,ifyouassumethattheACKisfortheoriginaltransmissionbutitwasreallyforthesecond,thentheSampleRTTistoolarge(a);ifyouassumethattheACKisforthesecondtransmissionbutitwasactuallyforthefirst,thentheSampleRTTistoosmall(b).

5.2ReliableByteStream(TCP)

224

Page 26: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

Figure8.AssociatingtheACKwith(a)originaltransmissionversus(b)retransmission.

Thesolution,whichwasproposedin1987,issurprisinglysimple.WheneverTCPretransmitsasegment,itstopstakingsamplesoftheRTT;itonlymeasuresSampleRTTforsegmentsthathavebeensentonlyonce.ThissolutionisknownastheKarn/Partridgealgorithm,afteritsinventors.TheirproposedfixalsoincludesasecondsmallchangetoTCP'stimeoutmechanism.EachtimeTCPretransmits,itsetsthenexttimeouttobetwicethelasttimeout,ratherthanbasingitonthelastEstimatedRTT.Thatis,KarnandPartridgeproposedthatTCPuseexponentialbackoff,similartowhattheEthernetdoes.Themotivationforusingexponentialbackoffissimple:Congestionisthemostlikelycauseoflostsegments,meaningthattheTCPsourceshouldnotreacttooaggressivelytoatimeout.Infact,themoretimestheconnectiontimesout,themorecautiousthesourceshouldbecome.Wewillseethisideaagain,embodiedinamuchmoresophisticatedmechanism,inthenextchapter.

Jacobson/KarelsAlgorithm

TheKarn/PartridgealgorithmwasintroducedatatimewhentheInternetwassufferingfromhighlevelsofnetworkcongestion.Theirapproachwasdesignedtofixsomeofthecausesofthatcongestion,but,althoughitwasanimprovement,thecongestionwasnoteliminated.Thefollowingyear(1988),twootherresearchers—JacobsonandKarels—proposedamoredrasticchangetoTCPtobattlecongestion.Thebulkofthatproposedchangeisdescribedinthenextchapter.Here,wefocusontheaspectofthatproposalthatisrelatedtodecidingwhentotimeoutandretransmitasegment.

Asanaside,itshouldbeclearhowthetimeoutmechanismisrelatedtocongestion—ifyoutimeouttoosoon,youmayunnecessarilyretransmitasegment,whichonlyaddstotheloadonthenetwork.Theotherreasonforneedinganaccuratetimeoutvalueisthatatimeoutistakentoimplycongestion,whichtriggersacongestion-controlmechanism.Finally,notethatthereisnothingabouttheJacobson/KarelstimeoutcomputationthatisspecifictoTCP.Itcouldbeusedbyanyend-to-endprotocol.

ThemainproblemwiththeoriginalcomputationisthatitdoesnottakethevarianceofthesampleRTTsintoaccount.Intuitively,ifthevariationamongsamplesissmall,thentheEstimatedRTTcanbebettertrustedandthereisnoreasonformultiplyingthisestimateby2tocomputethetimeout.Ontheotherhand,alargevarianceinthesamplessuggeststhatthetimeoutvalueshouldnotbetootightlycoupledtotheEstimatedRTT.

Inthenewapproach,thesendermeasuresanewSampleRTTasbefore.Itthenfoldsthisnewsampleintothetimeoutcalculationasfollows:

Difference=SampleRTT-EstimatedRTTEstimatedRTT=EstimatedRTT+(deltaxDifference)Deviation=Deviation+delta(|Difference|-Deviation)

wheredeltaisafractionbetween0and1.Thatis,wecalculateboththemeanRTTandthevariationinthatmean.

5.2ReliableByteStream(TCP)

225

Page 27: 2.5 Reliable Transmission - GitHub Pages · The Sliding Window Algorithm The sliding window algorithm works as follows. First, the sender assigns a sequence number, denoted SeqNum

TCPthencomputesthetimeoutvalueasafunctionofbothEstimatedRTTandDeviationasfollows:

TimeOut=muxEstimatedRTT+phixDeviation

wherebasedonexperience,muistypicallysetto1andphiissetto4.Thus,whenthevarianceissmall,TimeOutisclosetoEstimatedRTT;alargevariancecausestheDeviationtermtodominatethecalculation.

Implementation

TherearetwoitemsofnoteregardingtheimplementationoftimeoutsinTCP.ThefirstisthatitispossibletoimplementthecalculationforEstimatedRTTandDeviationwithoutusingfloating-pointarithmetic.Instead,thewhole

calculationisscaledby2 ,withdeltaselectedtobe1/2 .Thisallowsustodointegerarithmetic,implementing

multiplicationanddivisionusingshifts,therebyachievinghigherperformance.Theresultingcalculationisgivenbythefollowingcodefragment,wheren=3(i.e.,delta=1/8).NotethatEstimatedRTTandDeviationarestoredintheirscaled-upforms,whilethevalueofSampleRTTatthestartofthecodeandofTimeOutattheendarereal,unscaledvalues.Ifyoufindthecodehardtofollow,youmightwanttotrypluggingsomerealnumbersintoitandverifyingthatitgivesthesameresultsastheequationsabove.

{SampleRTT-=(EstimatedRTT>>3);EstimatedRTT+=SampleRTT;if(SampleRTT<0)SampleRTT=-SampleRTT;SampleRTT-=(Deviation>>3);Deviation+=SampleRTT;TimeOut=(EstimatedRTT>>3)+(Deviation>>1);}

ThesecondpointofnoteisthattheJacobson/Karelsalgorithmisonlyasgoodastheclockusedtoreadthecurrenttime.OntypicalUniximplementationsatthetime,theclockgranularitywasaslargeas500ms,whichissignificantlylargerthantheaveragecross-countryRTTofsomewherebetween100and200ms.Tomakemattersworse,theUniximplementationofTCPonlycheckedtoseeifatimeoutshouldhappeneverytimethis500-msclocktickedandwouldonlytakeasampleoftheround-triptimeonceperRTT.Thecombinationofthesetwofactorscouldmeanthatatimeoutwouldhappen1secondafterthesegmentwastransmitted.Onceagain,theextensionstoTCPincludeamechanismthatmakesthisRTTcalculationabitmoreprecise.

Alloftheretransmissionalgorithmswehavediscussedarebasedonacknowledgmenttimeouts,whichindicatethatasegmenthasprobablybeenlost.Notethatatimeoutdoesnot,however,tellthesenderwhetheranysegmentsitsentafterthelostsegmentweresuccessfullyreceived.ThisisbecauseTCPacknowledgmentsarecumulative;theyidentifyonlythelastsegmentthatwasreceivedwithoutanyprecedinggaps.Thereceptionofsegmentsthatoccurafteragapgrowsmorefrequentasfasternetworksleadtolargerwindows.IfACKsalsotoldthesenderwhichsubsequentsegments,ifany,hadbeenreceived,thenthesendercouldbemoreintelligentaboutwhichsegmentsitretransmits,drawbetterconclusionsaboutthestateofcongestion,andmakebetterRTTestimates.ATCPextensionsupportingthisisdescribedinalatersection.

RecordBoundariesSinceTCPisabyte-streamprotocol,thenumberofbyteswrittenbythesenderarenotnecessarilythesameasthenumberofbytesreadbythereceiver.Forexample,theapplicationmightwrite8bytes,then2bytes,then20bytestoaTCPconnection,whileonthereceivingsidetheapplicationreads5bytesatatimeinsidealoopthatiterates6

n n

5.2ReliableByteStream(TCP)

226