TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot [email protected].

11
TCP transfers over high TCP transfers over high latency/bandwidth network latency/bandwidth network & & Grid TCP Grid TCP Sylvain Ravot Sylvain Ravot [email protected] [email protected]

Transcript of TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot [email protected].

Page 1: TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot sylvain@hep.caltech.edu.

TCP transfers over high TCP transfers over high latency/bandwidth networklatency/bandwidth network

&&Grid TCPGrid TCP

Sylvain RavotSylvain [email protected]@hep.caltech.edu

Page 2: TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot sylvain@hep.caltech.edu.

Tests configurationTests configuration

POS 155 MbpsGbE GbE

Pcgiga-gbe.cern.ch(Geneva)

Plato.cacr.caltech.edu(California) Ar1-chicago Cernh 9

Calren2 / Abilene

Lxusa-ge.cern.ch (Chicago)

GbE

CERN (Geneva)<-->Caltech (California)CERN (Geneva)<-->Caltech (California) RTT : 175 msRTT : 175 ms Bandwith-delay product : 3,4 MBytes. Bandwith-delay product : 3,4 MBytes.

CERN <--> Chicago CERN <--> Chicago RTT : 110 msRTT : 110 ms Bandwidth-delay-product : 1.9 MBytes.Bandwidth-delay-product : 1.9 MBytes.

Tcp flows were generated by Iperf. Tcp flows were generated by Iperf. Tcpdump was used to capture packets flowsTcpdump was used to capture packets flows Tcptrace and xplot were used to plot and summarize tcpdump data set.Tcptrace and xplot were used to plot and summarize tcpdump data set.

Page 3: TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot sylvain@hep.caltech.edu.

TCP overview: Slow Start and TCP overview: Slow Start and congestion Avoidance Examplecongestion Avoidance Example

Cwnd average of the last 10 samples.

Cwnd average over the life of the connection to that point

Slow start Congestion Avoidance

SSTHRESH

Here is an estimation of the cwnd (Output of TCPtrace):

•Slow start : fast increase of the cwnd•Congestion Avoidance : slow increase of the window size

Page 4: TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot sylvain@hep.caltech.edu.

During congestion avoidance and without any loss, the cwnd increases by During congestion avoidance and without any loss, the cwnd increases by one segment each RTT. In our case, we have no loss, so the window one segment each RTT. In our case, we have no loss, so the window increases by 1460 bytes each 175 ms. If the cwnd is equal to 730 kbyte, it increases by 1460 bytes each 175 ms. If the cwnd is equal to 730 kbyte, it takes more than 5 minutes to have a cwnd larger than the bandwidth delay takes more than 5 minutes to have a cwnd larger than the bandwidth delay product (3,4 MByte). In other words, we have to wait almost 5 minutes to use product (3,4 MByte). In other words, we have to wait almost 5 minutes to use the whole capacity of the link (155 Mbps)!!!the whole capacity of the link (155 Mbps)!!!

SSTHRESH = 730Kbyte

SSTHRESH =1460Kbyte

Slow start

Congestion avoidance

Cwnd=f(time) ( Throughput = 33 Mbit/s) Cwnd=f(time) ( Throughput= 63 Mbit/s)

Influence of the initial SSTHRESH on Influence of the initial SSTHRESH on TCP performanceTCP performance

Page 5: TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot sylvain@hep.caltech.edu.

ReactivityReactivity

TCP reactivity TCP reactivity Time to recover a 200 Mbps throughput after a loss is larger than Time to recover a 200 Mbps throughput after a loss is larger than

50 seconds for a connection between Chicago and CERN.50 seconds for a connection between Chicago and CERN. A single loss is disastrousA single loss is disastrous

TCP is much more sensitive to packet loss in WANs than in LANsTCP is much more sensitive to packet loss in WANs than in LANs

TCP Throughput CERN-Chicago over the 622 Mbps link

0

50

100

150

200

350 370 390 410 430 450 470 490

Time (in s)

Th

rou

gh

pu

t (M

bp

s)

53 sec53 sec

Page 6: TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot sylvain@hep.caltech.edu.

Linux Patch “GRID TCP”Linux Patch “GRID TCP”

Parameter tuningParameter tuningNew parameter to better start a TCP transferNew parameter to better start a TCP transfer

Set the value of the initial SSTHRESHSet the value of the initial SSTHRESH Modifications of the TCP algorithms (RFC 2001)Modifications of the TCP algorithms (RFC 2001)

Modification of the well-know Modification of the well-know congestion avoidancecongestion avoidance algorithm algorithm During congestion avoidance, for every useful acknowledgement During congestion avoidance, for every useful acknowledgement

received, cwnd increases by received, cwnd increases by M * (segment size) * (segment size) / cwnd.M * (segment size) * (segment size) / cwnd.It’s equivalent to increase cwnd by M segments each RTT. M is It’s equivalent to increase cwnd by M segments each RTT. M is called congestion avoidance incrementcalled congestion avoidance increment

Modification of the slow start algorithmModification of the slow start algorithm During slow start, for every useful acknowledgement received, During slow start, for every useful acknowledgement received,

cwnd increases by N segments. N is called slow start increment.cwnd increases by N segments. N is called slow start increment.Note: N=1 and M=1 in common TCP implementations.Note: N=1 and M=1 in common TCP implementations.

Smaller backoff (Not implemented yet)Smaller backoff (Not implemented yet)Reduce the strong penalty imposed by a lossReduce the strong penalty imposed by a lossReproduce the behavior of a Multi-streams TCP connection.Reproduce the behavior of a Multi-streams TCP connection.

Only the sender’s TCP stack need to be modifiedOnly the sender’s TCP stack need to be modified Alternative to Multi-streams TCP transfersAlternative to Multi-streams TCP transfers

Page 7: TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot sylvain@hep.caltech.edu.

TCP tuning by modifying the slow TCP tuning by modifying the slow start incrementstart increment

Slow start, 0.8s

Slow start , 1.2s

Congestion window (cwnd) as function of the timeSlow start increment = 1, throughput = 98 Mbit/s

Congestion window (cwnd) as function of the timeSlow start increment = 3, throughput = 116 Mbit/s

Slow start, 2.0s

Cwnd of the last 10 samples.

Cwnd average over the life of the connection to that point

Slow start, 0.65s

Congestion window (cwnd) as function of the timeSlow start increment = 2, throughput = 113 Mbit/s

Congestion window (cwnd) as function of the timeSlow start increment = 5, throughput = 119 Mbit/s

Page 8: TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot sylvain@hep.caltech.edu.

TCP tuning by modifying the TCP tuning by modifying the congestion avoidance increment (1)congestion avoidance increment (1)

Congestion window (cwnd) as function of the time – Congestion avoidance increment = 1, throughput = 37.5 Mbit/s

Congestion window (cwnd) as function of the time – Congestion avoidance increment = 10, throughput = 61.5 Mbit/s

SSTHREH = 0.783 Mbyte

Cwnd is increased by 1200 bytes in 27 sec.

Cwnd is increased by 12000 bytes(10*1200)in 27 sec.

Page 9: TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot sylvain@hep.caltech.edu.

Benefice of larger congestion avoidance Benefice of larger congestion avoidance increment when losses occurincrement when losses occur

When a loss occur, the cwnd is divided by two. The performance is determined by the speed at which When a loss occur, the cwnd is divided by two. The performance is determined by the speed at which the cwnd increases after the loss. So higher is the congestion avoidance increment, better is the the cwnd increases after the loss. So higher is the congestion avoidance increment, better is the performance.performance.

We simulate losses We simulate losses by using a program by using a program which drops which drops packets according packets according to a configured loss to a configured loss rate. For the next rate. For the next two plots, two plots, the the program drop one program drop one packet every 10000 packet every 10000 packets.packets.

3) cwnd:=cwnd/2

2) Fast Recovery (Temporary state until the loss is repaired)

1) A packet is lost

Congestion window (cwnd) as function of the time – Congestion avoidance increment = 1, throughput = 8 Mbit/s

Congestion window (cwnd) as function of the time – Congestion avoidance increment = 10, throughput = 20 Mbit/s

Page 10: TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot sylvain@hep.caltech.edu.

TCP Performance ImprovementTCP Performance Improvement

Memory to memory transfersMemory to memory transfers

TCP performance between CERN and Chicago

0

50

100

150

200

250

300

350

1 2 3 4 5

Th

rou

gh

pu

t (M

bp

s)

WithoutWithout any any

tuningtuning

By tuning By tuning TCP TCP

buffersbuffers

““TCP Grid” TCP Grid” on 155 Mbps on 155 Mbps US-CERN US-CERN LinkLink

New bottlenecks New bottlenecks Iperf is not able to perform long transfersIperf is not able to perform long transfers Linux station with 32 bit 33 Mhz PCI busLinux station with 32 bit 33 Mhz PCI bus

(Will replace with modern server) (Will replace with modern server)

““TCP Grid” TCP Grid” on on 2 X 155 2 X 155 Mbps US-Mbps US-CERN LinkCERN Link

““TCP Grid” TCP Grid” on on 622 Mbps 622 Mbps US-CERN US-CERN LinkLink

Page 11: TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot sylvain@hep.caltech.edu.

ConclusionConclusion

To achieve high throughput over high latency/bandwidth To achieve high throughput over high latency/bandwidth network, we need to :network, we need to : Set the initial slow start threshold (ssthresh) to an Set the initial slow start threshold (ssthresh) to an

appropriate value for the delay and bandwidth of the appropriate value for the delay and bandwidth of the link.link.

Avoid loss Avoid loss by limiting the max cwnd size.by limiting the max cwnd size.

Recover fast if loss occurs:Recover fast if loss occurs:Larger cwnd increment => we increase faster the cwnd after a Larger cwnd increment => we increase faster the cwnd after a

losslossSmaller window reduction after a lossSmaller window reduction after a loss

… …....