tcpip

49
1 How the TCP/IP Protocol Works Les Cottrell – SLAC Lecture # 1 presented at the 26 th International Nathiagali Summer College on Physics and Contemporary Needs, 25 th June – 14 th July, Nathiagali, Pakistan Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP

description

ee

Transcript of tcpip

  • *How the TCP/IP Protocol WorksLes Cottrell SLACLecture # 1 presented at the 26th International Nathiagali Summer College on Physics and Contemporary Needs, 25th June 14th July, Nathiagali, PakistanPartially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP

  • *OverviewThis is not a lecture on how to program TCP/IP, rather an introduction to how major portions worksIPAddressing: IP addresses, ARP, routingICMP UDPTCP: flow control, error recovery, establishment, diconnectReferences:Internetworking with TCP/IP, volume I, principles, protocols & Architecture, by Douglas ComerTCP/IP Illustrated: the protocols, by W. Richard StevensMost information also available free via Web searches

  • *Internet Protocol (IP RFC-791) Transport ServicesConnectionless packet delivery serviceApplication servicesTCP/IP Internet provides 3 layers of serviceLayering allows one to replace one service without affecting othersIP layer (basic unit of transfer in TCP/IP) provides:Best-effort (does not discard capriciously), unreliable (no guarantees)Packet may be lost, duplicated, out-of-order with no notificationConnectionless (each packet treated independently)IP software provides routing

  • *Internet datagramBasic transfer unit

    Format of Internet datagramDatagram headerDatagram data area

    VersType of serv.Total length081631IdentificationFlags24Hlen4Fragment offset19TTLProtocolHeader ChecksumSource IP addressDestination IP addressIP Options (if any)PaddingData

  • *IP datagram format (cont.)Vers (4 bits): version of IP protocol (IPv4=4)Hlen (4 bits): Header length in 32 bit words, without options (usual case) = 20Type of Service TOS (8 bits): little used in past, now being used for QoSTotal length (16 bits): length of datagram in bytes, includes header and dataTime to live TTL (8bits): specifies how long datagram is allowed to remain in internetRouters decrement by 1When TTL = 0 router discards datagramPrevents infinite loopsProtocol (8 bits): specifies the format of the data areaProtocol numbers administered by central authority to guarantee agreement, e.g. TCP=6, UDP=17

  • *IP Datagram format (cont.)Source & destination IP address (32 bits each): contain IP address of sender and intended recipientOptions (variable length): Mainly used to record a route, or timestamps, or specify routing

  • *IP FragmentationHow do we send a datagram of say 1400 bytes through a link that has a Maximum Transfer Unit (MTU) of say 620 bytes?Answer the datagram is broken into fragments

    Router fragments 1400 byte datagramsInto 600 bytes, 600 bytes, 200bytes (note 20 bytes for IP header)Routers do NOT reassemble, up to end hostNet 1MTU=1500Net 2MTU=620Net 3MTU=1500

  • *Fragmentation ControlIdentification: copied into fragment, allows destination to know which fragments belong to which datagramFragment Offset (12 bits): specifies the offset in the original datagram of the data being carried in the fragmentMeasured in units of 8 bytes starting at 0Flags (3 bits): control fragmentationReserved (0-th bit)Dont Fragment DF (1st bit): useful for simple (computer bootstrap) application that cant handle also used for MTU discovery (see later)if need to fragment and cant router discards & sends error to sourceMore Fragments (least sig bit): tells receiver it has got last fragmentTCP traffic is hardly ever fragmented (due to use of MTU discovery). About 0.5% - 0.1% of TCP packets are fragmented .

  • *Fragment series compositionNB. If data segment contains its own header that is not replicatedOffset=0More fragsOffset=1480More fragsOffset=2960More fragsOffset=3440Last frag

  • *Internet AddressingIP address is a 32 bit integerRefers to interface rather than hostConsists of network and host portionsEnables routers to keep 1 entry/network instead of 1/hostClass A, B, C for unicastClass D for multicastClass E reservedClassless addressesWritten as 4 octets/bytes in decimal formatE.g. 134.79.16.1, 127.0.0.1

  • *Internet Class-based addressesClass A: large number of hosts, few networks0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh7 network bits (0 and 127 reserved, so 126 networks), 24 host bits (> 16M hosts/net)Initial byte 1-127 (decimal)Class B: medium number of hosts and networks10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16,384 class B networks, 65,534 hosts/networkInitial byte 128-191 (decimal)Class C: large number of small networks110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh2,097,152 networks, 254 hosts/networkInitial byte 192-223 (decimal)Class D: 224-239 (decimal) Multicast [RFC1112]Class E: 240-255 (decimal) Reserved

  • *SubnetsA subnet mask is applied to the host bits to determine how the network is subnetted, e.g. if the host is: 137.138.28.228, and the subnet mask is 255.255.255.0 then the right hand 8 bits are for the host (255 is decimal for all bits set in an octet)Host addresses of all bits set or no bits set, indicate a broadcast, i.e. the packet is sent to all hosts.

  • *Subnet Mask Conversions/1128.0.0.0/2192.0.0.0/3224.0.0.0/4240.0.0.0/5248.0.0.0/6252.0.0.0/7254.0.0.0/8255.0.0.0/9255.128.0.0/10255.192.0.0/11255.224.0.0/12255.240.0.0/13255.248.0.0/14255.252.0.0/15255.254.0.0/16255.255.0.0/17255.255.128.0/18255.255.192.0/19255.255.224.0/20255.255.240.0 /21255.255.248.0/22255.255.252.0/23255.255.254.0/24255.255.255.0/25255.255.255.128/26255.255.255.192/27255.255.255.224/28255.255.255.240/29255.255.255.248/30255.255.255.252/31255.255.255.254/32255.255.255.255PrefixLengthSubnet MaskPrefixLengthSubnet Mask 1281000 0000 1921100 0000 2241110 0000 2401111 0000 2481111 1000 2521111 1100 2541111 1110 2551111 1111Decimal OctetBinary Number

  • *Address depletionIn 1991 IAB identified 3 dangersRunning out of class B addressesIncrease in nets has resulted in routing table explosionIncrease in net/hosts exhausting 32 bit address spaceFour strategies to addressCreative address space allocation {RFC 2050}Private addresses {RFC 1918}, Network Address Translation (NAT) {RFC 1631}Classless InterDomain Routing (CIDR) {RFC 1519}IP version 6 (IPv6) {RFC 1883}

  • *Creative IP address allocationClass A addresses 64 127 reservedHandle on individual basisClass B only assigned given a demonstrated needClass C divided up into 8 blocks allocated to regional authorities208-223 remains unassigned and unallocatedThree main registries handle assignmentsAPNIC Asia & Pacific www.apnic.netARIN N. & S. America, Caribbean & sub-Saharan Africa www.arin.netRIPE Europe and surrounding areas www.ripe.net

  • *Private IP AddressesIP addresses that are not globally unique, but used exclusively in an organizationThree ranges:10.0.0.0 - 10.255.255.255 a single class A net172.16.0.0 - 172.31.255.255 16 contiguous class Bs192.168.0.0 192.168.255.255 256 contiguous class CsConnectivity provided by Network Address Translator (NAT) translates outgoing private IP address to Internet IP address, and a return Internet IP address to a private addressOnly for TCP/UDP packets

  • *Class InterDomain Routing (CIDR)Many organization have > 256 computers but few have more than several thousandInstead of giving class B (16384 nets) give sufficient contiguous class C addresses to satisfy needs< 256 addresses assign 1 class C< 8192 addresses assign 32 contiguous Class C nets

  • *Since assigned contiguously, class C CIDR has same most significant bits & so only needs one routing table entryCIDR block represented by a prefix and prefix lengthPrefix = single address representing block of nets, e.g192.32.136.0 = 11000000 00100000 10001000 00000000 while192.32.143.0 = 11000000 00100000 10001111 00000000

    Prefix length indicates number of routing bits, e.g.192.32.136.0/21 means 21 bits used for routingCIDR collects all nets in range 192.32.136.0 through 143.0 into a single router entry reduces router table entriesRemoves address classes A, B & C boundariesFor more details see RFC 1519

    CIDR & Supernetting21 bit prefix (2048 host addresses)

  • *Address Recognition Protocol (ARP)IP address is at network layer, need to map it to the MAC (Ethernet address) link layer addressUse ARP to map 48 bit Ethernet address to 32 bit IPIP requests MAC address for IP address from local ARP tableIf not there, then an ARP request packet for IP address is sent using physical broadcast address (all FFFs)Host with requested IP address responds with its MAC address as a unicast packetOn return, host updates ARP table and returns MAC addressARP cache times outARP packets are on top of Ethernet

  • *ARP cont.ARP requests are local only, do not cross routers

    Compare local IP and subnet mask => local subnetCompare local subnet to destination IPif local, ARP for MAC addresselse remote soif ROUTE entry, ARP for router to subnetif default route, ARP for default gatewayotherwise, drop packet & return error

    134.79.10.17134.79.15.3134.79.15.1134.79.10.1User AUser BSubnet 1Subnet 2

  • *RoutingRouters must select next hop for packetGet route information from other routers via a routing protocol (RIP, OSPF, EIGRP etc.)Note the following are non-routable:private networks: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16Loopback 127.0.0.0/24

  • *

    ICMP Purpose (RFC 792)Communicates control & error informationBetween routers and hostsOnly reports to original source, suggests correctionsError messages about error messages are not generatedNever generated due to multicastsPacket formatTypeCodeChecksum081631ICMP data (depends on type/code)

    24

  • *Main ICMP request types

    TypeICMP 0Echo reply, ping3Destination unreachable (code 1 host, code 3 port)DF and must fragment (code 4)4Source quench5Redirect (change a route)8Echo request11Time exceeded (code 0 ttl=0, code 1 reassembly)12Parameter problems

  • *ICMP Echo/PingVery commonly used diagnostic toolImplementations vary between OSBuild echo request

    Identifier used to match request to replies (e.g. pid)Sequence number, starts at 0 increments by 1 for each ping packetUsed to detect loss, reorder, duplicatesOptional data, sent by requester, returned by replierUsually contains a timestamp when the request was sent plus pad data

    Type=8Code=0Checksum081631

    IdentifierSequence numberOptional data24

  • *What do we learn from PingHost reachableHost may respond to ping but not be running servicesRound trip timingLost packetsPacket reordering duplicate packetsExample:

    13cottrell@noric05:~>ping -c 4 lhr.comsats.net.pkPING lhr.comsats.net.pk (210.56.16.10) from 134.79.125.205 : 56(84) bytes of data.64 bytes from lhr.comsats.net.pk (210.56.16.10): icmp_seq=0 ttl=242 time=716.962 msec64 bytes from lhr.comsats.net.pk (210.56.16.10): icmp_seq=1 ttl=242 time=720.375 msec64 bytes from lhr.comsats.net.pk (210.56.16.10): icmp_seq=2 ttl=242 time=725.907 msec64 bytes from lhr.comsats.net.pk (210.56.16.10): icmp_seq=3 ttl=242 time=710.734 msec

    --- lhr.comsats.net.pk ping statistics ---4 packets transmitted, 4 packets received, 0% packet lossround-trip min/avg/max/mdev = 710.734/718.494/725.907/5.566 ms

  • *Unreachable76cottrell@flora06:~>ping islamabad-server2.comsats.net.pkICMP 13 Unreachable from gateway 207.45.205.18 for icmp from FLORA06.SLAC.Stanford.EDU (134.79.16.101) to islamabad-server2.comsats.net.pk (210.56.8.8)

    What does this mean, see exercise?

  • *Time Exceeded

    Time-to-live has expired at a router (code=0)ttl sets bound on number routers datagram can transitPrevents infinite routine loopsInitialized by sender, decremented by 1 each time passes routerWhen ttl = 0 datagram thrown away & sender notified by ICMP messageFragment reassembly timer (code=1)

    Type 11CodeChecksum081631

    UnusedInternet header & 8 bytes of data24

  • *MTU DiscoveryPath MTUs varyFragmentation is badSmall transmission units are badSO need to discover optimum MTU (largest without fragmentation)Host sends a packet with the Dont Fragment bit setLength is lesser of local MTU and MSS announced by remote systemIf MTU between hosts requires fragmentation (e.g. at an intermediate router), then if an ICMP DF bit set & must fragment then an ICMP message is sent back to source, saying I cant fragmenttry again with smaller size.

  • *User Datagram Protocol - UDPRFC 768, Protocol 17

    Provides unreliable, connectionless on top of IPMinimal overhead, high performanceNo setup/teardown, 1 datagram at a timeApplication responsible for reliabilityIncludes datagram loss, duplication, delay, out-of-sequence, multiplexing, loss of connectivityIPPort 1TCPUDPPort 2Port 1Port 2Demux on IP protocolDemux onPort numberNetworkTransportApp.

  • *UDP Datagram formatSource/destination port: port numbers identify sending & receiving processesPort number & IP address allow any application in any computer on Internet to be uniquely identifiedUsed to demultiplex datagrams to processesPorts can be static or dynamicStatic (< 1024) assigned centrally, known as well known portsDynamic Message length in bytes includes the UDP header and data

  • *UDP applicationsMessage oriented, e.g. SNMP, DNS, timeFile system, e.g. NFS, AFSLightweight file transfer, e.g. tftp, bootp

  • *Transmission Control Protocol -TCPRFC 768 & host requirements RFC 1122Reliable stream transport Connection oriented (full duplex virtual circuit)Conceptually place call, two ends communicate to agree on detailsAfter agreeing application notified of connectionDuring transfer, ends communicate continuously to verify data received correctlyWhen done, ends tear down the connectionIf UDP is like regular mail, TCP is like phone callProvides buffering and flow controlTakes care of lost packets, out of order, duplicates, long delays Isolates application program from network detailsJargonSegment = TCP packetSocket= source (address + port) + destination (address + port)

  • *TCP layering

    To ID connection need:Source: (address, port) AND Destination: (address, port)Only need one port on host to allow multiple connections, since each connection will have different (host, port) at other endE.g. single host can serve multiple telnet connections Passive open: application contacts OS & indicates will accept incoming connection, OS assigns port and listensActive open: application requests OS to connect to an (host, port)

    IPPort 1TCPUDPPort 2Port 1Port 2Demux on IP protocolDemux onPort numberNetworkTransportApp.IP port 6

  • *TCP providing reliabilityPositive acknowledgement (ACK) with retransmissionSender keeps record of each packet sentSender awaits an ACKSender starts timer when sends packetSend pkt 1Rcv ACK 1Send pkt 2Rcv ACK 2Network messagesRcv pkt 1Rcv pkt 2Send ACK 2Send ACK 1Sender siteReceiver siteTime

  • *TCP simple lost packet recoverySend pkt 1Start timerACK normallyarrivesRcv ACK 1

    Network messagesPkt should arriveRcv pkt 1Send ACK 1ACK should be sentSender siteReceiver siteLossTimer expiresRetransmit pkt 1 start timer

  • *TCP improving performanceBUT simple ACK protocol wastes bandwidth since it must delay sending next packet until it gets ACKUse sliding window

    Sender can send 4 packets of data without ACKWhen sender gets ACK then can send another packetWindow = unacknowledged packets/bytesKeeps timer for each packet

    2 3 4 5 6 7 8 Initial window of 4 packets2 3 4 5 6 7 8 Window slidesPackets successfully sentPackets sent, awaiting ACKPackets to be sent

  • *Tuning to fill pipeOptimal window size depends on:Bandwidth end to end, i.e. min(BWlinks) AKA bottleneck bandwidthRound Trip Time (RTT)For TCP keep pipe full Window (sometime called pipe) ~ RTT*BWCan increase bandwidth by orders of magnitudeWindows also used for flow controlSrcRcv

  • *ImplementationSliding window operates at byte level, NOT packet

    Receiver keeps similar window to put stream back togetherSince full duplex, altogether 4 windows & pointer sets2 3 4 5 6 7 8 Current windowHighest byte that can be sentBytes sent and acknowledged3 pointersHighest byte sent

  • *TCP flow controlWindows vary over timeReceiver advertises (in ACKs) how many it can receiveBased on buffers etc. availableSender adjusts its window to match advertisementIf receiver buffers fill, it sends smaller advertsUsed to match buffer requirements of receiverAlso used to address congestion control (e.g. in intermediate routers)

  • *TCP Segment format

    Source/Dest port: TCP port numbers to ID applications at both ends of connectionSequence number: ID position in senders byte stream

    Source port

    Destination portSequence number08163124Acknowledgement number4Hlen10ResvCodeWindowUrgent ptrChecksumOptions (if any)PaddingData if any

  • *TCP segment format cont.Acknowledgement: identifies the number of the byte the sender of this segment expects to receive nextHlen: specifies the length of the segment header in 32 bit multiples. If there are no options, the Hlen = 5 (20 bytes)Reserved for future use, set to 0Code: used to determine segment purpose, e.g. SYN, ACK, FIN, URG

  • *TCP Segment format- contWindow: Advertises how much data this station is willing to accept. Can depend on buffer space remaining.Checksum: Verifies the integrity of the TCP header and data. It is mandatory.Urgent pointer: used with the URG flag to indicate where the urgent data starts in the data stream. Typically used with a file transfer abort during FTP or when pressing an interrupt key in telnet.Options: used for window scaling, SACK, timestamps, maximum segment size etc.

  • *TCP timeoutNeed a timeout estimate that will work for LANs (RTT < msec.) to satellite WANs (hundreds of msec. to secs). RTT can vary a lot with time of day, day of week, or one second to next.TCP records time segment sent and time ACK receivedThen calculates RTT sampleSmooth & use to estimate timeout, e.g.Timeout=beta * RTTsTimeout= RTTs + eta{=4}*f(dev(RTTs))Needs to take account of losses, e.g.New_timeout=gamma{2} * timeoutMay 12thRTT ms.Time of day

  • *TCP connection establishment3 way handshake

    Initial sequence numbers (x, y) are chosen randomlyGuarantees both sides ready & know it, and sets initial sequence numbers, also sets window & mssOnce connection established, data can flow in both directions, equally well, there is no master or slave

    Send SYN seq xRcv SYN/ACKSend ACK y+1Rcv SYN segmentRcv ACK segmentSend SYN seq=y, ACK x+1Site 1Site 2ActiveWin 4096, mss 1024PassiveWin 4096, mss 1024

  • *TCP close connectionModified 3 way handshake (or 4 way termination)

    App tells TCP to close, TCP sends remaining data & waits for ACK, then sends FINSite 2 TCP ACKs FIN, tells its application end of dataSite 2 sends FIN when its app closes connection (may be long delay (e.g. require human interaction).(App closes) Send FIN seq=xRcv ACK segmentRcv FIN segmentReceive ACK segmentSend ACK x=1(inform app)Site 1Site 2Rcv FIN + ACK segSend ACK y+1(app closes connection)Send FIN seq=y, ACK x+1

  • *More InformationLectures, tutorials etc:www.nv.cc.va.us/home/joney/tcp_ip.htmwww.cs.pdx.edu/~jrb/tcpip.lectures.htmlwww.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm www.cis.ohio-state.edu/htbin/rfc/rfc1180.html www.jbmelectronics.com/tcp.htm Encylopaediahttp://www.freesoft.org/CIE/index.htmTCP/IP Resourceswww.private.org.il/tcpip_rl.html Understanding IP addresseshttp://www.3com.com/solutions/en_US/ncs/501302.htmlConfiguring TCP (RFC 1122)ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txtAssigned protocols, ports etc (RFC 1010)http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols

  • *Example: 3 way handshakeatlas> telnet sunstats.cern.chatlas is a WNT PC, sunstats is a Sun Solaris 5.6 hostMSS is set in TCP option in a SYN segment, communicates the MSS the sender wants to receive len=ip_hlen/tcp_hlen:ip_total_lenInitial Sequence Numbers are randomly selectedTelnet = port 23W=Receive window size advertises how much data this host will accept

  • *Example: 3 way handshake - cont.TCP from atlas:1174 to sunstats:23 seq=180839, A=0, W=8192, SYN [len=5/6:44, opt=020405B4 ]TCP from sunstats:23 to atlas:1174 seq=1383568304, A=180840, W=64240, SYN/ACK [len=5/6:44, opt=020405B4]TCP from atlas:1174 to sunstats:23 seq =180840, A=1383568305, W=8760 [len=5/5:40, opt=nul]Notice window size can vary from segment to segment depending on buffer space availableNotice smaller PC window advertisementNotice ephemeral port selected by telnet client Notice acknowledge next expected byte (=seq+1)0x020405B4: 02 = option type, 04=len, 0x5B4=1460

  • *Session startSLAC>CERN: 256kbyte window,1 stream, full speed > 30msec, 13MBytes in 20s, 5.1MBytes/sRcvr Advertised windowAcks returned by Rcvr Segments sentCongestion window

    **How do we measure the QoSIntroduction to PingER and active end-to-end measurement methodologyProblem areas illustrated by results from PingER:Generally, e.g. S. America, Spain, China, Germany to .edu & .caHow do E. Europe & Russia look?How does performance affect applications Validating ping measurements and impact on FTP & Web performanceOverview of impact of performance on applications including email, web, FTP, interactive appsDetailed look at bulk data transfer expectations for HENP sitesDetailed look at critical performance metrics (RTT, loss, jitter, availability) and impact on VoIPWhat can be done to improve QoS: More bandwidth Reserved bandwidth Differentiated services

    **Need routing to get message back to origin*The address range from 0.0.0.0 through 0.255.255.255 should not be considered part of the normal Class A range. 0.x.x.x addresses serve no particular function in IP, but nodes attempting to use them will be unable to communicate properly on the Internet. 127.0.0.1 loopback test mechanism of network adapters. Messages sent to 127.0.0.1 do not get delivered to the network. Instead, the adapter intercepts all loopback messages and returns them to the sending application. IP applications often use this feature to test the behavior of their network interface.127.0.0.0 through 127.255.255.255 reserved for loopback, 224-239 are used for multicast (see http://www.firewall.cx/multicast-intro.php, also Google IGMP & PIM))The range of addresses between 224.0.0.0 and 224.0.0.255, inclusive, is reserved for the use of routing protocols and other low-level topology discovery or maintenance protocols, such as gateway discovery and group membership reporting. Multicast routers should not forward any multicast datagram with destination addresses in this range, regardless of its TTL. 255.0.0.0 through 255.255.255.255.255 reserved for IP broadcast*Class B addresses require demonstrated need: subnetting plan for > 32 subnets, > 4096 hosts192-193 Multiregional194-195 Europe196-197 Others198-199 N. America200-201 Central/South America202-203 Pacific Rim204-207 Reserved

    APNIC=Asia pacific Network Information CenterARIN = American Registry for Internet NumbersRIPE NCC = Reseau IP Europeens*Must adhere to:Cannot be referenced by hosts in another organizationCannot be defined to any external routerCannot be advertised addresses, and cannot forward IP datagrams containing those address to external routersExternal routers will quietly discard all routing information regarding these addresses.Multicast is in the class D range 224.0.0.0 to 239.255.255.255 or 224.0.0.0/4*Removes the address classes A, B, C boundaries. These are called Classfull networks*Why restrict communication to original source: datagram only contains original source & ultimate destination, does not contain complete travel itinerary of route taken. Since routing dynamic cannot know path has or will use.*Use the Internet to find out what PING stands for.**Low overhead since: no set up or tear down, deals with only one datagram at a time*