Post on 19-Jan-2016
Chapter 5: Transport Layer 1
Computer NetworksAn Open Source Approach
Chapter 5: Transport Layer
2
Content
5.1 General Issues Port-Multiplexing, Reliability, Flow/Congestion Control
5.2 UDP - Unreliable Connectionless Transfer Port-Multiplexing
5.3 TCP - Reliable Connection-Oriented Transfer Connection Management Reliability Flow Control Performance Enhancements
5.4 Socket Programming Interface 5.5 Real-time Transport (RTP & RTCP) 5.6 Summary
Chapter 5: Transport Layer
Chapter 5: Transport Layer 3
5.1 General Issues
End-to-End Communication ChannelData IntegrityFlow ControlSocket Programming Interface
Chapter 5: Transport Layer 4
5.1 General Issues End-to-End Communication Channel: Port-
Multiplexing Port: communication end point
Multi-Access Channel
MACMAC IP Network TCP/UDPTCP/UDP
AP1 AP2 AP1 AP2IP IP
Condense delay distribution Loose delay distribution
Node-to-Node Channel End-to-End Channel
LAN host 1 LAN host 2 IP host 1 IP host 2
Chapter 5: Transport Layer 5
General Issues: Direct-Linked vs. End-to-End
Note: per-frame integrity such as Ethernet Collision: can be detected and be retransmitted CRC/alignment error: can only rely on upper-layer protocols
Direct-Linked Protocol Layer End-to-End Protocol Layer
base on what services? physical layer internetworking layer
services
addressingnode-to-node channel within a link. (MAC address)
process-to-process channel between hosts (port number)
error checking per-frame per-segment
data reliability per-link per-flow
flow control per-link per-flow
channel delay condensed distribution loose distribution
6
Open Source Implementation 5.1: an incoming packet in the transport layer
Copyright Reserved 2009
Network layer
raw_v4_input(skb)
raw_rcv(sk,skb)
raw_rcv_skb(sk,skb)
__skb_queue_tail(sk->sk_receive_queue, skb)
sk->sk_data_ready
udp_rcv(skb)
sk=udp_v4_lookup(skb)
udp_queue_rcv_skb(sk,skb)
socket_queue_rcv_skb(sk,skb)
udp_recvmsg (sk,buf)
skb_recvdatagram
skb=skb_dequeue
tcp_v4_do_rcv(sk,skb)
tcp_rcv_established
tcp_data_queue(sk,skb)
tcp_v4_rcv(skb)
tcp_recvmsg(sk,buf)
skb_copy_datagram_iovec
raw_recvmsg (sk, buf)
sk=__raw_v4_lookup(skb) sk=inet_lookup(skb)
sk_receive_queue
RAW UDP TCPTransport layer
io_local_deliver_finish
read
sys_read
do_sock_read
sock_recvmsg
Application layer
vfs_read
do_sync_read
sock_aio_read
recvfrom
sys_socketcall
sys_recvfrom
sock_recvmsg
__sock_recvmsg
sock_common_recvmsg
6Chapter 5: Transport Layer
7
Open Source Implementation 5.1: an outgoing packet in the transport layer
Copyright Reserved 2009
write
sys_write
do_sock_write
sock_sendmsg
udp_sendmsg(sk,buf)raw_sendmsg(sk,buf)
tcp_sendmsg(sk,buf)
skb_queue_tail(&sk->sk_write_queue,skb)
ip_append_data(sk, buf)
skb=sock_alloc_send_skb(sk)
ip_generic_getfrag
sk_write_queue
ip_push_pending_frames
ip_queue_xmitdst_output
skb->dst->output
ip_output
Transportlayer
Networklayer
Applicationlayer
__tcp_push_pending_frames
tcp_transmit_skb
vfs_write
do_sync_write
sock_aio_write
inet_sendmsg
tcp_push
tcp_write_xmit
sendto
sys_socketcall
sys_sendto
sock_sendmsg
inet_sendmsg
skb=sock_wmalloc(sk)
udp_push_pending_frames
7Chapter 5: Transport Layer
8
5.2 UDP – Unreliable Connectionless Transfers
ObjectivesHeader FormatUnicast Real-time Applications Using UDP
Copyright Reserved 2009Chapter 5: Transport Layer
9
5.2 UDP – For Unreliable Connectionless Transfers Objectives
Port-Multiplexing
Per-Segment Error Checking: Checksum Header Format
Carrying Unicast/Multicast Real-Time Traffic Retransmission is Meaningless: No Per-Flow Integrity Needed Bit-rate is Determined by Codec Used: No Flow Control Needed
IP Networks TCPTCP
AP1 AP2 AP1 AP2
IP host 1 IP host 2
source port number destination port number
UDP length UDP checksum (optional)
data (if any)
0 15 16 31
8 bytes
~~~~
Copyright Reserved 2009Chapter 5: Transport Layer
Open Source Implementation 5.2: UDP and TCP Checksum
Checksum in TCP/IPIn Linux 2.6:
IP Header TCP/UDP Header Application Data
csum=csum_partial(D,lenD,0)
tcp_v4_check(T, lenT, SA, DA, csum)
csum=csum_partial(T,lenT,csum)
ip_send_check(iph)
Pseudo Header
th->check = tcp_v4_check(len, inet->saddr, inet->daddr,csum_partial((char *)th,th->doff << 2,skb->csum));
Copyright Reserved 2009 10Chapter 5: Transport Layer
11
5.3 TCP – Reliable Connection-Oriented Transfers
ObjectivesConnection ManagementPer-Flow Data IntegrityPer-Flow Flow/Congestion ControlPerformance Problems and Enhancements
Copyright Reserved 2009Chapter 5: Transport Layer
12
5.3 TCP – For Reliable Connection-Oriented Transfers Objectives
Port-Multiplexing: Same as UDP Per-Flow Reliability Per-Flow Flow Control
Connection Management Connection Establishment/Disconnection & State Transitions
Per-Flow Data Integrity Per-Frame Checksum & Per-Flow ACKs
Per-Flow Flow/Congestion Control Performance
Interactive vs. Bulk-Data Transfers
Stateful (Ch1) !! Requires connection management
Copyright Reserved 2009Chapter 5: Transport Layer
13
TCP Connection Management Establishment/Termination – 3-Way
Handshake Protocol
Establishment Termination
Copyright Reserved 2009
SYN (seq=x)
ACK of SYN (ack=x+1)
FIN
ACK of FIN
ACK of FIN
FIN
client server
SYN (seq=y)
(seq=x)ACK of SYN (ack=y+1)
client server
Chapter 5: Transport Layer
14
TCP State Transition Diagram
Copyright Reserved 2009
CLOSED
LISTEN
SYN_RCVD SYN_SENT
ESTABLISHED CLOSE_WAIT
LAST_ACK
FIN_WAIT_1 CLOSING
TIME_WAITFIN_WAIT_2
recv: ACKsend: nothing
app: send datasend: SYN
app: active open
send: SYN
app: passive opensend: nothing
recv: SYN send: SYN,ACK
recv: RST
app: close or timeout
recv: SYNsend: SYN, ACK
simultaneous open
recv: ACKsend: nothing
passive close
simultaneous close
app: closesend: FIN
app: close
send: FIN
recv: SYN, ACK
send: ACK
active open
app: closesend: FIN
recv: FIN, ACK
send: nothing
recv: ACKsend: nothing
recv: FINsend: ACK
active close
data transfer state
recv: FINsend: ACK
recv: ACKsend: nothing
recv: FINsend: ACK
serverclient
timeoutsend: RST
Chapter 5: Transport Layer
State Transitions: Establishment
SYN (seq=x)
ACK of SYN (ack=x+1)SYN (seq=y)
(seq=x)ACK of SYN (ack=y+1)
client server
CLOSED LISTEN
ESTABLISHED
ESTABLISHED
SYN_SENT
SYN_RCVD
15Chapter 5: Transport Layer
State Transitions: Termination
FIN
ACK of FIN
ACK of FIN
FIN
client serverESTABLISHED ESTABLISHED
FIN_WAIT_1
FIN_WAIT_2
TIME_WAIT
CLOSED
CLOSE_WAIT
LAST_ACK
CLOSED2MSL timeout
16Chapter 5: Transport Layer
State Transitions: Simultaneous Open/Close
Chapter 5: Transport Layer 17
SYN (seq=x)
ACK of SYN (ack=x+1)
FIN
ACK of FIN
ACK of FIN
FIN
client server
SYN (seq=y)
SYN (seq=x)ACK of SYN (ack=y+1)
client server
CLOSED LISTEN
ESTABLISHED
ESTABLISHED
SYN_SENT
SYN_RCVD
ESTABLISHED ESTABLISHED
FIN_WAIT_1
TIME_WAIT
CLOSED
CLOSING
2MSL timeout
SYN_SENT
SYN_RCVD
SYN (seq=y)FIN_WAIT_1
CLOSING
TIME_WAIT
CLOSED
2MSL timeout
(a) state transitions in simultaneous open (b) state transitions in simultaneous close
State Transitions : Loss in Establishment
SYN (seq=x)
client server
CLOSED LISTEN
CLOSED
SYN_SENT
timeout
SYN (seq=x)
ACK of SYN (ack=x+1)SYN (seq=y)
client server
CLOSED LISTEN
CLOSED
SYN_SENT
SYN_RCVD
CLOSED
timeout
timeout
CLOSED
SYN (seq=x)
ACK of SYN (ack=x+1)SYN (seq=y)
(seq=x+1)ACK of SYN (ack=y+1)
client server
CLOSED LISTEN
SYN_SENT
SYN_RCVD
ESTABLISHED
CLOSED
CLOSED
timeout
LISTEN
LISTEN
(a) SYN sent by the client is lost
(b) SYN sent by the server is lost
(c) ACK of SYN sent by the client is lost
RST
RST
18Chapter 5: Transport Layer
19
TCP State Transition Implementation
In “sock” structure
State names
Copyright Reserved 2009Chapter 5: Transport Layer
20
Reliability of Data Transfers
Definition: Data Reliability vs. Data Integrity Data Integrity:
Successfully received packets are exactly the same as they are transmitted.
Data Reliability: Every transmitted packet is successfully received and is
exactly the same as the original transmitted one.
TCP Per-Segment Integrity: Checksum Per-Flow Reliability: Sequence Number & ACK
Copyright Reserved 2009Chapter 5: Transport Layer
21
Per-Flow Data Reliability: Sequence Number & Acknowledgement Per-Flow Data Reliability: Sequence Number & ACK
ACK every successfully received data segment Segment sent but not ACK?
Dropped by some intermediate router Insufficient buffer Forced drop
Dropped by the receiver Wrong checksum
Retransmitting Lost Packets When to Retransmit Which?
Copyright Reserved 2009Chapter 5: Transport Layer
Data Reliability: Cumulative ACK (1/2)
DATA(Seq=100, Len=50)
timeo
ut
X
time
Client Server
ACK(Ack=180)
(a) packet loss
DATA(Seq=100, Len=50)
ACK(Ack=150)
time
Client Server
DATA(Seq=150, Len=30)
ACK(Ack=180)
Tim
eout
DATA(Seq=150, Len=30)
ACK(Ack=100)
DATA(Seq=100, Len=50)
(b) delay
DATA(Seq=100, Len=50)
Tim
eout
ACK(Ack=180)
duplicate datadrop it
22Chapter 5: Transport Layer
Data Reliability: Cumulative ACK (2/2)
DATA(Seq=100, Len=50)
timeo
uttime
Client Server
ACK(Ack=180)
(d) out-of-sequence
DATA(Seq=150, Len=30)
ACK(Ack=100)
DATA(Seq=100, Len=50)
ACK(Ack=150)
timeo
ut
X
time
Client Server
DATA(Seq=150, Len=30)
ACK(Ack=180)
(c) ACK loss
23Chapter 5: Transport Layer
Pseudo Code of Sliding Window in the Sender
Chapter 5: Transport Layer 24
SWS: send window size.n: current sequence number, i.e., the next packet to be transmitted.LAR: last acknowledgment received. if the sender has data to send
Transmit up to SWS packets ahead of the latest acknowledgment LAR, i.e., it may transmit packet number n as long as n < LAR+SWS.
endif
if an ACK arrives, Set LAR as ack num if its ack num > LAR.
endif
25
Per-Flow Flow/Congestion Control Sliding Window
3 9 10
TCP Window Size( = min(RWND, CWND) )
DATA 8
DATA 7
DATA 6
ACKAck=6
ACKAck=5
Receiver
2Sending Stream
Sent & ACKed To be sentWhen window moves
Network Pipe(size=Data 4~8)
sliding
Receiving byte stream
Sender
2 3 4 5
Should maintain a retransmission queue in case of packet loss
Should maintain a out-of-order queue to re-sequence before returning to application
SND_UNASND_NXT
Copyright Reserved 2009Chapter 5: Transport Layer
Sliding Window : Normal Case (1/2)
3 4
DATA 9
DATA 8
6
ACKAck=7
ACKAck=6
Receiver
2
Sending Stream sliding
DATA 7
Sender
2 3 4 5
(b) Sender receives ACK(Ack=5)
3
DATA 8
DATA 7
DATA 6
ACKAck=6
ACKAck=5
Receiver
2
Sending Stream sliding
Sender
2 3 4 5
(a) Original condition
Network Pipe
TCP Window Size= min(RWND, CWND)
Sent & ACKed To be sentWhen window moves
26Chapter 5: Transport Layer
Sliding Window : Normal Case (2/2)
3 4 5
DATA10
DATA 9
6
ACK Ack=8
ACK Ack=7
Receiver
2
Sending Stream sliding
7 DATA 8
Sender
2 3 4 5
(c) Sender receives ACK(Ack=6)
3 4 5 6
DATA11
DATA 10
6
ACKAck=9
ACK Ack=8
2
Sending Stream sliding
7 8 DATA 9
Sender
2 3 4 5
(d) Sender receives ACK(Ack=7)
27Chapter 5: Transport Layer
Sliding Window : Out-of-Sequence(1/2)
DATA 8
6
ACK Ack=7
ACKAck=4
ReceiverDATA 7
Sender
2 3 4 5
(b) Sender receives ACK(Ack=4) of DATA 5
3
DATA 8
DATA 7
6
ACKAck=4
ACKAck=4
Receiver
2
Sending Stream
Sender
2 3 DATA 4 5
(a) Original condition
Network Pipe
TCP Window Size= min(RWND, CWND)
Sent & ACKed To be sentWhen window moves
32
Sending Stream
28Chapter 5: Transport Layer
Sliding Window : Out-of-Sequence(2/2)
3 4 5 6
DATA11
DATA 10
6
ACKAck=9
ACKAck=8
2
Sending Stream sliding
7 8
Sender
2 3 4 5
(d) Sender receives ACK(Ack=7)
6
ACKAck=8
ACKAck=7
Receiver7 DATA 8
Sender
2 3 4 5
32
Sending Stream
(c) Sender receives ACK(Ack=4) of DATA 6
DATA 9
29Chapter 5: Transport Layer
30
Per-Flow Flow/Congestion Control Opening & Shrinking of Window Size
3 9 10
TCP Window Size( = min(RWND, CWND) )
2
Open Shrink Close
Copyright Reserved 2009Chapter 5: Transport Layer
31
Retransmitting Lost Packets
Retransmit Which Packet? Fast Retransmit Towards Better Accuracy: TCP SACK Option
Further Refinement: FACK (based on SACK)
When to Retransmit? Fast Retransmit: same as above Retransmission Timeout (RTO)
Round-Trip Time (RTT) Measurement Tradeoff: RTT vs. RTO
Karn’s Algorithm Towards Better RTO: TCP Timestamp Option
Copyright Reserved 2009Chapter 5: Transport Layer
32
Retransmit Which Packet?
Fast Retransmit Duplicate ACKs
Packet Reordering Packet Loss Internet Route Change
TCP Receiver ACK the First “Hole” Triple Duplicate ACKs (TDA)
4 Same ACKs (ACK field=X) TCP Sender Infer TDA as Congestion Retransmit the Packet with SeqNum=X Halve Its Sending Rate
2 3 6 7
3 4 4 4
8
4
2 3 6 7 8
Time at Receiver
ACK
DATA
Copyright Reserved 2009Chapter 5: Transport Layer
33
When to Retransmit?
Retransmission TimeOut (RTO) Round-Trip Time (RTT) Measurement vs. RTO
RTT: Varying Dramatically Smoothed RTT (SRTT) : Exponential Weighted Moving
Average Mdev: Mean Deviation of RTT
RTO=SRTT+4*Mdev
Karn’s Algorithm Don’t Update RTO When Retransmission is Also Lost
Copyright Reserved 2009Chapter 5: Transport Layer
34
TCP RTT Estimator
Fast estimator by Van Jacobson ’88 & ‘90 srtt (smoothed rtt) is kept in 8 times RTT mdev is kept in 4 times the real mean deviation In tcp_input.c from Linux 2.6:
exponential weighted moving average
Copyright Reserved 2009Chapter 5: Transport Layer
35
Per-Flow Flow/Congestion Control How Fast to Send?
Fast Sender vs. Slow Receiver How to Know?
Feedback RWND (Receiver Advertised Window) in ACK by Receiver
Fast Sender vs. Congested Network How to Know?
Feedback Loss Events by Network Re-adjust (Congestion Window) CWND
How Fast? Satisfy Both: min (RWND, CWND)
Copyright Reserved 2009Chapter 5: Transport Layer
TCP Tahoe Congestion Control
Slow start Congestion
avoidance
Retransmission
timeout
Fast
retransmit
timeout
all ACKed
cwnd ≧ ssth
≧ 3 duplicate ACK
timeout
≧ 3 duplicate ACK
start
ACKcwnd=cwnd+1
send packet ACK
cwnd=cwnd+ 1/ cwndsend data packet
send missing packetssth=cwnd/2cwnd=1
cwnd=1
36Chapter 5: Transport Layer
37
Slow Start & Congestion Avoidance Slow Start Congestion Aviodance
source destination source destination
Copyright Reserved 2009Chapter 5: Transport Layer
An example: TCP Tahoe (1/2)
cwnd=8awnd=8
38 37 36 35 34 33 32 31
cwnd=8awnd=8
Sender sent segment 31-38
Receiver replied seven duplicate ACKs of segment 30
S
S
D
D
(1)
(2)
38Chapter 5: Transport Layer
An example: TCP Tahoe (2/2)
cwnd=1awnd=8
30 30
cwnd=1awnd=8
38
31
cwnd=1awnd=1
Sender received three duplicate ACKs and cwnd is changed 1 packets. The lost segment 31 is retransmitted. Sender exited the fast transmit and entered the slow start state.
Receiver replied the ACK of segment 38 when it received the retransmitted segment 31.
Sender sent segment 39.
S
S
S
D
D
D
(3)
(4)
(5)
39
30 30
39Chapter 5: Transport Layer
40
TCP Reno Congestion Control (RFC 2581)
Copyright Reserved 2009
Slow start Congestionavoidance
Retransmission
timeout
Fast
retransmit
Fast recovery
timeout
all ACKed
cwnd ≧ ssth
≧ 3 duplicate ACK
ssth=cwnd/2cwnd=ssthsend missing packet
timeout
>= 3 duplicate ACK = x
non-duplicate
ACK > x
timeout
start
ACKcwnd=cwnd+1
send packet
duplicate ACK
cwnd=cwnd+1send data packet
ACK
cwnd=cwnd+ 1/ cwndsend data packet
cwnd=1
cwnd=ssth
Chapter 5: Transport Layer
An example: TCP Reno
cwnd=7awnd=8
cwnd=11awnd=8->11
38
31
cwnd=4awnd=3->4
Sender received three duplicate ACKs and cwnd is changed to (8/2)+3 packets. The lost segment 31 is retransmitted. Sender exited the fast transmit and entered the fast recovery state.
Receiver replied the ACK of segment 38 when it received the retransmitted segment 31.
Sender exited the fast recovery and entered the congestion avoidance state. Cwnd is changed to 4 segments.
S
S
S
D
D
D
(3)
(4)
(5)
42
30 30 30 30
39 40 41
39 40 41
41Chapter 5: Transport Layer
Open Source Implementation 5.4: TCP Slow Start and Congestion Avoidance if (tp->snd_cwnd <= tp->snd_ssthresh) { /* Slow start*/
if (tp->snd_cwnd < tp->snd_cwnd_clamp)tp->snd_cwnd++;
} else {if (tp->snd_cwnd_cnt >= tp->snd_cwnd) {
/* Congestion Avoidance*/if (tp->snd_cwnd < tp->snd_cwnd_clamp)
tp->snd_cwnd++;tp->snd_cwnd_cnt=0;
} elsetp->snd_cwnd_cnt++;
}}
Copyright Reserved 2009 42Chapter 5: Transport Layer
43
Principle in Action: TCP Congestion Control Behaviors
slow-start
congestion avoidance
triple-duplicate ACKs
fast retransmit
pipe limitssth reset
fast recovery
Copyright Reserved 2009Chapter 5: Transport Layer
Chapter 5: Transport Layer 44
TCP Header Format
destination port number
headerlength
U A P window size
options (if any)
data
TCP checksum
0 4 15 16 31
20 bytes
32-bit sequence number
32-bit acknowledgement number
urgent pointer
6-bit reserved R S F
source port number
~~~~
~~~~
Chapter 5: Transport Layer 45
TCP Options
kind=0
kind=1
kind=2
kind=3
kind=8 len=10
len=3
len=4Maximum
segment size(MSS)
shiftcount
timestamp value timestamp echo reply
End of option list
No operation
Maximumsegment size
Window scalefactor
Timestamp
Chapter 5: Transport Layer 46
TCP Options End of Option List
As name suggests No Operation
Padding fields to a multiple of 4 bytes Maximum Segment Size
Negotiating the max transfer unit at 3-way handshake
kind=0
kind=1
kind=2 len=4Maximum
segment size(MSS)
End of option list:
No operation:
Maximumsegment size:
2 bytes1 byte1 byte
1 byte
1 byte
Chapter 5: Transport Layer 47
TCP Options (Window Scale Factor, RFC 1323) Issue: window too small when in Gigabit
networks, causing limited throughput Solution: negotiate a shifting factor for window
Negotiate during 3-way handshaking SYN with timestamp, then SYN+ACK with timestamp
Shift up to 14 bits (from 216 to 216x214) When this option is not used:
Linux do not advertise window over 215 to avoid other stack that uses signed bit (include/net/tcp.h)
kind=3 len=3shift
countWindow scale
factor:
1 byte 1 byte 1 byte
Chapter 5: Transport Layer 48
TCP Options – Timestamp
Mission 1 – Improving RTT measurement Receiver: copies & replies the timestamp
Delayed ACK Sender: always update RTT when seeing timestamp
Mission 2 – Protecting Wrapped SeqNum Avoid receiving old segments in high speed network
How to enable timestamp option? 3-way handshake
Timestamp in SYN, timestamp in its ACK
kind=8 len=10 timestamp value timestamp echo replyTimestamp:
4 bytes 4 bytes1 byte 1 byte
Chapter 5: Transport Layer 49
TCP Timer Management in Linux Retransmit Timer
To start retransmitting Persist Timer
To prevent deadlocks Keepalive Timer (non-standard)
To clean up redundant TCP states
Functions of All TCP Timers
Chapter 5: Transport Layer 50
Name Function
connection timer To establish a new TCP connection, a SYN segment is sent. If no response of the SYN segment is received within connection timeout, the connection is aborted.
retransmission timer TCP retransmits the data if data is not acknowledged and this timer expires.
delayed ACK timer The receiver must wait till delayed ACK timeout to send the ACK. If during this period there is data to send, it sends the ACK with piggybacking.
persist timer A deadlock problem is solved by the sender sending periodic probes after the persist timer expires.
keepalive timer If the connection is idle for a few hours, the keepalive timeout expires and TCP sends probes. If no response is received, TCP thinks that the other end is crashed.
FIN_WAIT_2 timer This timer avoids leaving a connection in the FIN_WAIT_2 state forever, if the other end is crashed.
TIME_WAIT timer The timer is used in the TIME_WAIT state to enter the CLOSED state.
Chapter 5: Transport Layer 51
Open Source Implementation 5.5: TCP Retransmit Timer Approximating RTT
Linux provides good retx timer granularity Just like other timers
BSD-derived UNIXs have bad granularity For minimizing timer overhead
check wether ACKed every 500 ms RTT is over-estimated RTO is then over-estimated Slow packet retx when lost recovered not by Fast Retx
Chapter 5: Transport Layer 52
Open Source Implementation 5.6: TCP Persistent (or Probe) Timer When RWND=0 && Next RWND>0 lost
Deadlock occurs Sender waits for RWND>0 (window update) Receiver waits for new data
Solution Sender emits one byte of data to probe
Persist timer
Use RTO with binary exponential backoff until 120 seconds
tcp_output.c (Linux 2.6)
Chapter 5: Transport Layer 53
Open Source Implementation 5.6 (cont): TCP Keepalive Timer (Non-standard) When no data exchange for a long time
Connection Timeout? Belongs to application’s preference
The other end is dead? Linux 2.6 Implementation (tcp_timer.c)
Call tcp_keepalive_timer() every 75 seconds Initialized by af_inet init routine searches every established TCP connection
If dead & not reboot => state cleared after 5 probes If dead & reboot => state cleared after getting RST
Chapter 5: Transport Layer 54
Data Structures of TCP Connections in Linux Important variables:
include/net/sock.h
Chapter 5: Transport Layer 55
Summary: Properties of TCP
Per-Flow Reliability Through ACKs Window-based Flow Control Self-clocking using ACKs
Chapter 5: Transport Layer 56
TCP Performance
Interactive Connections Silly Window Syndrome
Bulk-Data Transfers ACK Compression Phenomenon Reno’s Multiple Packet Loss Problem
Chapter 5: Transport Layer 57
TCP Performance Problems & Enhancement Interactive Connections
Silly Window Syndrome (SWS) Solution: Clark & Nagle
Bulk Data Transfers ACK Compression Phenomenon
Possible solution: Paced TCP Sender Reno’s Multiple Packet Loss Problem
Solution: NewReno, SACK, FACK
TCP Performance Problems and SolutionsTransmission Style Problem Solution
Interactive connection Silly window syndrome Nagle, Clark
Bulk-data transfer ACK compression Zhang
Reno’s MPL* problem NewReno, SACK, FACK
*MPL stands for Multiple-Packet-Loss
Chapter 5: Transport Layer 58
Chapter 5: Transport Layer 59
Performance of Interactive Connections – Problems & Solutions Problem: Silly Window Syndrome (SWS)
Sender transmits small packets Receiver advertises small window
Solution Sender sends whenever either of the following holds:
Data Accumulated to Full-sized Segment Data Accumulated to ½ RWND Nagle’s Algorithm Disabled/Not Applied
Receiver advertises window whenever either of the following holds: Buffer available to full-sized Segment Buffer available to ½ of its buffer space
SWS : Receiver Advertises Small Window Client Server
RWND = 320
240/320
RWND = 80
2. Receive Segment; Send Ack, Reduce Window to 80
220/320
RWND = 40
4. Receive Segment; Send Ack, Reduce Window to 40
60/80
200/320
RWND = 30
4. Receive Segment; Send Ack, Reduce Window to 30
30/40
60/80
Data(Seq=1, Len=320)
Data(Seq=321, Len=80)
Data(Seq=401, Len=40)
ACK(Ack=321, RWND=80)
ACK(Ack=401, RWND=40)
ACK(Ack=441, RWND=30)
• • •
60Chapter 5: Transport Layer
Chapter 5: Transport Layer 61
Performance Enhancement of Interactive Connections – Nagle’s Algorithm To efficiently utilize the bandwidth resource
TELNET: Typing speed vs. available bandwidth When RTT is short (bandwidth may be sufficient)
Inter-character spacing > RTT Only one outstanding packet per RTT => efficient!!!
When RTT is large (bandwidth may be insufficient) Inter-character spacing < RTT
Multiple single-character packets per RTT => inefficient!! Nagle: don’t send small packet until pipe is clean (keep
only one packet in pipe) => efficient!!! When RTT is short
Nagle’s Algorithm is rarely used When RTT is large
Nagle’s Algorithm is often used
The beauty of Nagle
Chapter 5: Transport Layer 62
Chapter 5: Transport Layer 63
Performance of Bulk Data Transfers Computing the Performance through Bandwidth Delay
Product (BDP) Horizontal Dimension: Delay Vertical Dimension: Bandwidth Shaded Area: Packet Size BDP=pipe size=Bandwidth x RTT
Chapter 5: Transport Layer 64
Performance of Bulk Data Transfers Filling the network pipe
Highest Performance: Pipe is full
Pipe for sending data packets
Pipe for replying ACKs
WAN Pipe
TCP Sender TCP Receiver
TCP Sender TCP Receiver
Chapter 5: Transport Layer 65
Performance of Bulk Data Transfers Steps of filling the pipe using Congestion Avoidance
cwnd=1
cwnd=2
cwnd=3
cwnd=4
cwnd=5
cwnd=6
(1) (2) (3) (4) (5) (6)
(7) (8) (9) (10) (11) (12)
(13) (14) (15) (16) (17) (18)
(19) (20) (21) (22) (23) (24)
(25) (26) (27) (28) (29) (30)
(31) (32) (33) (34) (35) (36)
Chapter 5: Transport Layer 66
Performance of Bulk Data Transfers Modeling TCP Throughput
Given RTT, segment size s, loss rate p:
where c is a constant value Given additional information: Max Window Size Wm, #
delayed ACK b, RTO
pt
scpstT
RTT
RTT
),,(
)321(8
33,1min
32
,min),,,(2pp
bpt
bpt
s
t
sWpsttT
RTORTTRTT
mRTORTT
Chapter 5: Transport Layer 67
Problem of TCP Bulk Data Transfers:ACK-Compression Phenomenon Bursty traffic when
Simultaneous 2-Way Traffic Asymmetric Path
No general solution Distribute a window of packets across an RTT may alleviate the
phenomenonSlow link
Properspacing
ReceiverSender ACKs have proper spacing
Slow link
Historical Evolution: Multiple-Packet-Loss Recovery in NewReno, SACK, FACK and Vegas Solution (I) to TCP Reno’s Problem: TCP NewReno
Solution (II) to TCP Reno’s Problem: TCP SACK Solution (III) to TCP Reno’s Problem: TCP FACK Solution (IV) to TCP Reno’s Problem: TCP Vegas
Copyright Reserved 2009 68Chapter 5: Transport Layer
Chapter 5: Transport Layer 69
Problem of TCP Bulk Data Transfers:Reno’s Multiple Packet Lost Problem(1/2)cwnd=8
awnd=8
38 37 36 35 34 33 32 31
cwnd=8awnd=8
30 30 30 30 30
cwnd=7awnd=8
30 30
31
Sender sent segment 31-38
Receiver replied five duplicateACKs of segment 30
Sender received three duplicateACKs and cwnd is changed to(8/2)+3 packets. The lostsegment 31 is retransmitted.
S
S
S
D
D
D
(1)
(2)
(3)
Chapter 5: Transport Layer 70
Problem of TCP Bulk Data Transfers:Reno’s Multiple Packet Lost Problem(2/2)cwnd=9
awnd=8->9
32
39
cwnd=4awnd=7
32
cwnd=4awnd=7
Receiver replied the ACK ofsegment 32 when it received theretransmitted segment 31. Thisis a partial ACK.
Sender exited the fast recoveryand entered the congestionavoidance state when receivingthe partial ACK. Cwnd ischanged to 4 segments.
Sender waited until timeout
S
S
S
D
D
D
(4)
(5)
(6)
Chapter 5: Transport Layer 71
Eliminating MPL Problem (I):TCP NewReno (1/3) RFC 2582: Extending Fast-Recovery Phase
Remain in Fast-Recovery until All data in pipe before detecting 3-Dup ACK are ACKed
cwnd=8awnd=8
38 37 36 35 34 33 32 31
cwnd=8awnd=8
30 30 30 30 30
cwnd=7awnd=8
30 30
31
Sender sent segment 31-38
Receiver replied five duplicateACKs of segment 30
Sender received three duplicateACKs and cwnd is changed to(8/2)+3 packets. The lostsegment 31 is retransmitted.
S
S
S
D
D
D
(1)
(2)
(3)
Copyright Reserved 2009
Chapter 5: Transport Layer 72
Eliminating MPL Problem (I):TCP NewReno (2/3)
41
cwnd=9awnd=8->9
32
39
cwnd=8awnd=7->8
32
cwnd=9awnd=8->9
Receiver replied the ACK ofsegment 32 when it received theretransmitted segment 31. Thisis a partial ACK.
Sender received a partial ACK ofsegment 32 and immediatelyretransmitted the lost segment33. Cwnd is changed to 9-2+1
Sender received a duplicateACK and added cwnd by 1, thussegment 41 is kicked out.Receiver replied a partial ACKand one duplicate ACK ofsegment 33.
S
S
S
D
D
D
(4)
(5)
(6)
40 33
33 33
cwnd=9awnd=9->8S D(7)
33
34 The partial ACK triggered thesender retransmitting segment34 and shrink the awnd to 8 (41-33). Receiver replied an ACK ofsegment 33 upon receivingsegment 34.33
Chapter 5: Transport Layer 73
Eliminating MPL Problem (I):TCP NewReno (3/3)
33
cwnd=10awnd=9->10S D(8)
43 42 34
cwnd=11awnd=10->11S D(9)
44
Upon receiving the duplicateACK of segment 33, cwnd wasadvanced by one. Since awndwas smaller than cwnd, two newsegments were sent.
On receiving the duplicate ACKof segment 33, cwnd wasadvanced by one and thussegment 44 was triggered out.
43 42 34
cwnd=11awnd=10->11S D(10) Receiver replied ACKs of
segment 40, 42, 43, and 44.
40 4342 44
cwnd=4awnd=4S D(11)
Sender exited fast recoveryupon receiving the ACK ofsegment 40. Cwnd and awndwere reset to 4.
4342 44
Chapter 5: Transport Layer 74
Eliminating MPL Problem (II):TCP SACK (1/2) Reporting non-contiguous block of data
31
30 30
4 5
cwnd=8awnd=8
38 37 36 35 34 33 32 31
cwnd=8awnd=8
30 30 30 30 30
cwnd=4awnd=6
Sender received ACK ofsegment 30 and sent segment31-38.
Receiver sent five duplicateACKs with SACK options ofsegment 30
Sender received duplicate ACKsand began retransmitting the lostsegments reported in the SACKoptions. Awnd was set to 8-3+1(three duplicate ACKs and oneretransmitted segment.).
S
S
S
D
D
D
(1)
(2)
(3)
1 2 3 4 5
1
2
3
4
5
(32,32; 0, 0; 0, 0)
(35,35;32,32; 0, 0)
(35,36;32,32; 0, 0)
(35,37;32,32; 0, 0)
(35,38;32,32; 0, 0)
SACK options:
Chapter 5: Transport Layer 75
Eliminating MPL Problem (II):TCP SACK (2/2)
34
cwnd=4awnd=4
32
cwnd=4awnd=2->4
Receiver replied partial ACKs forreceived retransmittedsegments.
Sender received partial ACKs,reduced awnd by two, and thusretransmitted two lost segments.
S
S
D
D
(4)
(5)
cwnd=4awnd=4S D(6)
Receiver replied ACKs forreceived retransmittedsegments.
cwnd=4awnd=4S D(7)
42 41 40
Sender exited fast recovery afterreceiving ACK of segment 38.
33
33 38
39
Chapter 5: Transport Layer 76
Eliminating MPL Problem (III):TCP FACK (1/2) Extension of SACK, better estimation of awnd
30
3
31
30 30
4 5
cwnd=8awnd=8
38 37 36 35 34 33 32 31
cwnd=8awnd=8
30 30 30 30 30
cwnd=4awnd=4
Sender received ACK ofsegment 30 and sent segment31-38.
Receiver sent five duplicateACKs with SACK options ofsegment 30
Sender received two duplicateACKs and began retransmittingthe lost segments reported in theSACK options.
S
S
S
D
D
D
(1)
(2)
(3)
1 2 3 4 5
1
2
3
4
5
(32,32; 0, 0; 0, 0)
(35,35;32,32; 0, 0)
(35,36;32,32; 0, 0)
(35,37;32,32; 0, 0)
(35,38;32,32; 0, 0)
SACK options:
Chapter 5: Transport Layer 77
Eliminating MPL Problem (III):TCP FACK (2/2)
43 42 41
40
cwnd=4awnd=4
32
cwnd=4awnd=4
Sender calculated awnd forreceived duplicate ACKs andkept sending packets allowed.
Receiver replied ACKs.
S
S
D
D
(4)
(5)
cwnd=4awnd=4S D(6) Sender exited fast recovery after
receiving ACK of segment 38.
40
39 34 33
33 38 39
Chapter 5: Transport Layer 78
Performance of Bulk Data Transfers
What have you observed?
When RTTs are heterogeneous……
Chapter 5: Transport Layer 79
5.4 Socket Programming Interface
Programming Interface to Protocol Layers in LinuxAccessing End-to-End Protocol LayerAccessing Internetworking Protocol LayerAccessing Direct-Linked Protocol Layer
Packet Capturing & Filtering
Chapter 5: Transport Layer 80
5.4 Socket Programming Interface Issue: programming interface to protocol layers
Socket interface in Linux 2.2.17 kernel
BSD Socket INET Socket
TCP/UDP IP
ethernet NIC Driverethernet-header builder
ARPICMP …
Socket Library
Application
Kernel-space
User-spaceSocket interface
drivers/net/*.{c,h}
net/ethernet/eth.c
net/ipv4/{ip*,icmp*}
net/ipv4/{tcp*,udp*}
net/ipv4/af_inet.c
net/socket.c
Chapter 5: Transport Layer 81
Bridging Applications & End-to-End Protocols socket(domain, type, protocol)
INET domain: AF_INET type
UDP: SOCK_DGRAM TCP: SOCK_STREAM
Protocol: NULL Typical Applications:
telnet ftp HTTP
Chapter 5: Transport Layer 82
Elementary Socket: TCP Client/Server
socket()
connect()
write()
read()
close()
socket()
bind()
listen()
accept()
read()
write()
read()
close()
connection establishment(TCP Three-way handshake)
data (request)
data (reply)
end-of-life notification
process request
blocks until connectionfrom client
TCP Server
TCP Client
obtain a descriptor
assign IP & portto the socket
1. switch to passive socket2. create connection queue
enter ESTABLISHED state
initiate 3-wayhandshake
obtain a descriptor
Chapter 5: Transport Layer 83
Elementary Socket: UDP Client/Server
socket()
sendto()
recvfrom()
close()
socket()
bind()
sendto()
data (request)
data (reply)
process request
blocks until connectionfrom client
UDP Server
UDP Clientrecvfrom()
obtain a descriptor
obtain a descriptor
assign IP & port to thesocket
Chapter 5: Transport Layer 84
Open Source Implementation 5.7: Socket Read/Write Inside out
User SpaceServer Client
Server socket creation send data Client socket creation send data
socket() bind() listen() write()accept() socket() read()connect()
sys_listen
inet_listen
sys_write
do_sock_write
sock_sendmsg
inet_sendmsg
tcp_sendmsg
tcp_write_xmit
sys_socket
sock_create
inet_create
sys_bind
inet_bind
sys_accept
inet_accept
tcp_accept
wait_for_connection
Kernel Space
sys_socket
sock_create
inet_create
sys_read
do_sock_read
sock_recvmsg
sock_common_
recvmsg
tcp_recvmsg
memcpy_toiovec
sys_connect
inet_stream_connect
tcp_v4_getport
tcp_v4_connect
inet_wait_connect
sys_socketcall
Internet
sys_socketcall
Chapter 5: Transport Layer 85
Open Source Implementation 5.7: Socket Read/Write Inside out union u
linux/sched.hstruct files_struct
f_dentryf_list
max_fds
f_opf_vfsmnt
f_countf_flagsf_modef_pos
……
d_flagsd_count
d_inoded_parent
……
linux/fs.hstruct file
linux/dentry.hstruct dentry
connectclose
disconnect
ioctlaccept
initdestoryshutdownsetsockoptgetsockopt
net/sock.hstruct proto
sendmsgrecvmsg
……
tcp_v4_connecttcp_close
tcp_disconnect
tcp_ioctltcp_accept
tcp_v4_init_socktcp_v4_destory_socktcp_shutdowntcp_setsockopttcp_getsockopttcp_sendmsgtcp_recvmsg
……
ipv4/tcp_ipv4.cstruct tcp_func
net/sock.hstruct sock
s_addrd_addr
dport
bound_dev_ifsport
……receive_queuewrite_queue
proto……
……union tp_pinfo
struct tcp_opt……
snd_cwnd……
……sk_filter
……socket
……
struct socket
……
linux/fs.hstruct inode
……
inodefile
……
……
sk
file_lockcount
max_fds
next_fdmax_fdset
fd[0]fd[1]
fd[255]……
……
opened Linux socket
Performance Matters: Interrupt and Memory Copy at Socket
Chapter 5: Transport Layer 86
Latency in transmitting TCP segments in the TCP layer
Latency in receiving TCP segments in the TCP layer
Chapter 5: Transport Layer 87
Bridging Applications to Internetworking Protocols in Linux 2.6 socket(domain, type, protocol)
Parameters: PACKET domain: PF_PACKET type: SOCK_DGRAM Protocol: NULL
Kernel functions net/packet/af_packet.c
Typical Applications: ping traceroute
Chapter 5: Transport Layer 88
Bridging Applications to Node-to-Node Protocols in Linux 2.6 socket(domain, type, protocol)
Parameters: PACKET domain: PF_PACKET type: SOCK_RAW Ethernet Encapsulated IP packet: ETH_P_IP
Or others in “/usr/include/linux/if_ether.h”
Complete access to Ethernet header Kernel functions
net/packet/af_packet.c
Typical Applications: Packet sniffers => performance problem!!! Hacking tools
Open Source Implementation 5.8: Bypassing the End-to-End Layerint main() { int n; int fd; char buf[2048]; if((fd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL))) == -1) { printf("fail to open socket\n"); return(1); } while(1) { n = recvfrom(fd, buf, sizeof(buf),0,0,0); if(n>0) printf("recv %d bytes\n", n); } return 0;}
Copyright Reserved 2009 89Chapter 5: Transport Layer
Open Source Implementation 5.9: Making Myself Promiscuous
strncpy(ethreq.ifr_name,"eth0",IFNAMSIZ);
ioctl(sock, SIOCGIFFLAGS, ðreq);
ethreq.ifr_flags |= IFF_PROMISC;
ioctl(sock, SIOCSIFFLAGS, ðreq);
Copyright Reserved 2009 90Chapter 5: Transport Layer
Chapter 5: Transport Layer 91
Packet Sniffers: Packet Capturing & Filtering Capture until what header?
Towards Efficient Packet Filtering: Layered Model User-Space Tool: tcpdump User-Space Packet Filter: libpcap (portable) Kernel-Space Packet Filter: Linux Socket Filter
Copyright Reserved 2009 92
Open Source Implementation 5.10: Linux Socket Filter Linux Socket Filter (net/core/filter.c)
Similar to BPF (Berkley Packet FIilter)
network monitor
network monitor
rarpd
filter filter filter
buffer buffer bufferprotocol
stack
user
kernel
link-leveldriver
link-leveldriver
link-leveldriver
network
kernel
92Chapter 5: Transport Layer
Chapter 5: Transport Layer 93
5.5 Transport Protocols for Streaming
IssuesReal-Time Transport Protocol (RTP)RTP Control Protocol (RTCP)Example: VoIP Gateway Using RTP/RTCP
Issue 1: Multi-homing & Multi-streaming
Stream Control Transmission Protocol Multi-homing
a session of the SCTP can be concurrently constructed by multiple connections through different network adapters
a heartbeat for each connection Multi-streaming
Support ordered reception for each streaming Avoid the HOL blocking of TCP. a 4-way handshake mechanism for security
Copyright Reserved 2009 94Chapter 5: Transport Layer
Issue 2: Smooth Rate Control and TCP-friendliness AIMD is not suitable for streaming TCP-friendliness: A flow should ….
respond to the congestion at the transit state use no more bandwidth than a TCP flow at the
steady state when both received the same network conditions,
such as packet loss ratio and RTT. Datagram Congestion Control Protocol
(DCCP) : free selection of a congestion control scheme
Copyright Reserved 2009 95Chapter 5: Transport Layer
Principle in Action: Streaming: TCP or UDP? Why not TCP
loss retransmission mechanism continuous rate fluctuation
Why not UDP too simple, dropped by network devices for security
Both are the only two mature protocols, so.. UDP is used to carry pure audio streaming,
like audio and VoIP. TCP is used for streaming : large buffer ->delay
OK one-way application, e.g. watching clips from YouTube Not OK for the interactive application, like video conference,
Copyright Reserved 2009 96Chapter 5: Transport Layer
Chapter 5: Transport Layer 97
Issues 3: Playback Reconstruction and Path Quality Report Issues: Codec Encapsulation & Path Quality Report
Data-Plane: Video/Voice Codecs Video: H.263… Voice: G.729…
Control-Plane: Delay/Jitter/Loss Report RFC Standards: RTP & RTCP
RTP: Data-Plane, Encapsulating the Chosen Codec RTCP: Control-Plane, Reporting Delay/Jitter/Loss to
Senders
Chapter 5: Transport Layer 98
RTP (Real-Time Protocol)
Objectives Eliminating Packet Reorder & Loss Detection:
Sequence # Timestamp Synchronization Source Identifier Contributing Source Identifier
Header Format
Chapter 5: Transport Layer 99
RTCP (Real-Time Transport Protocol) Objectives
Reporting End-to-End Delay Reporting Delay Jitter Reporting Loss Rate
Report to sender for what? Switch to lower-bitrate codec
User may get smoother real-time
Chapter 5: Transport Layer 100
VoIP using RTP: Multiplexing using SSRC One RTP session between VoIP gateways
Many phone call between branch offices Multiplexing using different SSRC ID within the RTP
session
IP Cloud
Public Telephone Network
Public Telephone Network
Gatekeeper
VoIP Gateway
VoIP Gateway
Phone Phone
Internet or private IP network
Chapter 5: Transport Layer 101
VoIP using RTP: Codec Encapsulation Compress/Decompress
Analog to Digital Compander
Analog to Digital
Converter
CompanderA-Lawu-Law
VoIP Gateway
64 kbps
8 bits, 8khz
128 kbps
16 bits, 8khz
Digitaloutputsignal
Analog signal source
64kbps
Inside a VoIP Gateway Codec
The converter assigns16 bits evenly distributedacross x,y coordinates of the sine
The compander compresses the data
Chapter 5: Transport Layer 102
Historical Evolution: RTP Implementation Resources Sample Implementation in RFC 1889
http://rfc.net/rfc1889.txt Vat
http://www-nrg.ee.lbl.gov/vat/ Rtptools
ftp://ftp.cs.columbia.edu/pub/schulzrinne/rtptools/ NeVoT
http://www.cs.columbia.edu/~hgs/rtp/nevot.html RTP Library
http://www.iasi.rm.cnr.it/iasi/netlab/gettingSoftware.html by E.A.Mastromartino offers convenient ways to
incorporate RTP functionality into C++ Internet applications.
5.6 Summary (1/2) Three key features in process-to-process
channels (1) port-level addressing, (2) reliable packet
delivery, (3) flow rate control UDP: (1) only; TCP: all of them
TCP techniques three-way handshake ack/retx, sliding-window flow control various versions of congestion control
to retx potentially lost packets
Chapter 5: Transport Layer 103
5.6 Summary (2/2) Real-time transport by RTP/RTCP
multi-streaming, multi-homing, smooth rate control, TCP-friendliness, playback reconstruction, and path quality reporting
Socket interfaces to different layers
Chapter 5: Transport Layer 104