Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

22
Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004

Transcript of Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Page 1: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson

TCP Tuning and E2E Performance

TREFpunkt - October 20, 2004

Page 2: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

The speed-of-light problem

The sender must store every sent packet until it has received an ACK from the receiver

Due to the speed of light limitations this might take a while, even in small countries like Sweden

Theoretical RTT Luleå-Stockholm is (1000/300000)*2 = 6.7ms, in reality 20ms

TCP window size to keep up with 1Gbit/s must then be (1000/8)*.02 = 2.5Mbyte

Page 3: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Operating system buffers

Inside the operating system kernel there are usually a bunch of different buffers affecting performance

The term “buffers” is somewhat misleading, usually it is just some sort of data structure that is used to reference data in memory (but in theory it could as well be real buffers)

Page 4: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

TCP window buffers

The TCP window sizes can be adjusted on virtually all operating systems

There are two windows, send and receive

The window size for one direction of flow is set to MIN(sender’s send window, receiver’s receive window)

The send window must be large enough to keep all segments sent during the RTT

Page 5: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Socket buffers

Limits the amount of data an application may write to the kernel before being blocked

Often combined with the TCP send window, when ACKs are received the socket buffer data is adjusted accordingly

Must be >= TCP window size to avoid limitations

Page 6: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

MBUF clusters

There are limitations how many network buffers (in many OSes called MBUFs) that may be allocated

MBUFs may have external storage associated with them, allocated out of a separate (limited) area

These buffers are often allocated at compile time and it is not uncommon that physical memory is static allocated for them

Page 7: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Other knobs to turn

RFC1323 Turns on “Window scaling option” needed to use

larger TCP windows than 64k

Initial window size Avoid slow-start by injecting many packets into the

network at connection startup

Interface queues Be able to store the packets that are ready to send

until the network interface can transmit them

Page 8: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Problems often seen

Packet loss

On a long-distance high-speed connection, packet loss in a TCP flow will reduce the speed significantly

If the sender enters congestion avoidance, the congestion window will open linearly, and with large windows this will be really slow

With an RTT of 185ms and window size of 25MB it will take around 50 minutes to reach full speed

Page 9: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Problems often seen

Packet bursts During the startup of a TCP bulk flow, the

exponential increase in packet injection into the network during slow-start may cause packet bursts on links with large bandwidth-delay product

The result may be that intermediate switches/routers must drop packets, even though the TCP self-clocking would not permit more packets to be sent than could be received

Page 10: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Problems often seen

ACK/window updates Traditional approach for bulk flows is for the

receiver to send an ACK each second received packet

Window updates are sent as soon as data is delivered to the receiving process

This will cause the return traffic to be more than half the number of the transmitted packets

Interrupts, packet handling in the sending host may use a significant amount of CPU

Page 11: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Problems often seen

ARP timeouts

When an ARP entry times out, it is usually just removed from the ARP cache, and the next packet will initiate a new ARP request

If there is an ongoing packet flow, this approach may cause packets to be dropped until an ARP reply is received

Page 12: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Tuning of NetBSD

sysctl -w net.inet.tcp.rfc1323=1 Activate window scaling and timestamp options due to

RFC1323. sysctl -w kern.somaxkva=[sbmax]

Set maximum size for all socket buffers together in the system

sysctl -w kern.sbmax=[sbmax] Set maximum size of socket buffer for one TCP flow

sysctl -w net.inet.tcp.recvspace=[wstd] sysctl -w net.inet.tcp.sendspace=[wstd]

Set max size of TCP windows. sysctl kern.mbuf.nmbclusters

View maximum number of mbuf clusters. Used for storage of data packets to/from the network interface. Can only be set by recompiling Your kernel.

Page 13: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Tuning of FreeBSD

sysctl net.inet.tcp.rfc1323=1 Activate window scaling and timestamp options due to

RFC1323. sysctl ipc.maxsockbuf=[sbmax]

Set maximum size of TCP window. sysctl net.inet.tcp.recvspace=[wstd] sysctl net.inet.tcp.sendspace=[wstd]

Set max size of TCP windows. sysctl kern.ipc.nmbclusters

View maximum number of mbuf clusters. Used for storage of data packets to/from the network interface. Can only be set att boot time.

Page 14: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Tuning of Linux

echo "1" > /proc/sys/net/ipv4/tcp_window_scaling Activate window scaling according to RFC 1323

echo [wmax] > /proc/sys/net/core/rmem_max echo [wmax] > /proc/sys/net/core/wmem_max

Set maximum size of TCP windows. echo [wmax] > /proc/sys/net/core/rmem_default echo [wmax] > /proc/sys/net/core/wmem_default

Set default size of TCP windows. echo "[wmin] [wstd] [wmax]" >

/proc/sys/net/ipv4/tcp_rmem echo "[wmin] [wstd] [wmax]" >

/proc/sys/net/ipv4/tcp_wmem Set min, default, max windows. Used by the autotuning

function. echo "bmin bdef bmax" > /proc/sys/net/ipv4/tcp_mem

Set maximum total TCP buffer-space allocatable. Used by the autotuning function.

Page 15: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Tuning of Windows (2k, XP, 2k3)

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Tcp1323Opts=1 Turn on window scaling option

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpWindowSize =[wmax] Set maximum size of TCP window

Page 16: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

How to set a Land Speed Record

Recipe: Really high-quality networks Hardware capable of sending/receiving fast

enough Operating system without foolish bottlenecks Enthusiasts that spend weekends sending an

obscene amount of data between Luleå and San Jose

Page 17: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

GigaSunetOC-192 core

SprintlinkOC-192 core

10GE

10GE

OC192

End host inLuleå, Sweden

End host inSan Jose, CA

SUNET Internet Land Speed Record - Network setup

Network path consists of 42(!) router hops, using paths shared with other users of the networks.

Page 18: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Records submitted September 12

1 966 080 000 000 bytes in 3648 real seconds = 4310 Mbit/second

1831 Gbytes in almost exactly an hour 120 000 packets/second transferred with an MTU of

4470 bytes Record submitted for the IPv4 single and multiple

stream class is 124.935 Petabit-meters/second (which is a 78% increase of our previous record)

Page 19: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Compared with others

Compared to the previous record, we can note thatwe achieved this, using

Less powerful end hosts 200% longer distance Less than half the MTU size

(which generates heavier CPU-load on the end-hosts)

The normal GigaSunet and Sprintlink production infrastructures

Page 20: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Fiber path for the Internet LSR

Distance from Luleå, Sweden to San Jose, CA is approximately 28,983 km (18,013 miles)

Page 21: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

Network load

Page 22: Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004.

Anders Magnusson<[email protected]>

October 20, 2004

More to read…

http://proj.sunet.se/LSR Describes how the Land Speed Record(s) were achieved

http://proj.sunet.se/E2E About end-to-end performance in GigaSunet