5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 1...

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester

1

TCP/IP on

High Bandwidth Long Distance Pathsor

So TCP works … but still the users ask:

Where is my throughput?

Richard Hughes-Jones The University of Manchester

www.hep.man.ac.uk/~rich/ then “Talks” and look for Haystack

http://www.hep.man.ac.uk/~rich/


2

Layers & IP


3

The Network Layer 3: IP IP Layer properties:

Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Connection less Provides logical addresses Provides routing Demultiplex data on protocol number


4

The Internet datagram

31HlenVers Type of serv. Total length

0 8 16

Identification Flags

244

Fragment offset

19

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

20 Bytes

Frame header Transport FCSIP header


5

IP Datagram Format (cont.) Type of Service – TOS:

now being used for QoS Total length: length of datagram

in bytes, includes header and data Time to live – TTL: specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol: specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement, e.g. ICMP=1, TCP=6, UDP=17 … Source & destination IP address: (32 bits each) contain

IP address of sender and intended recipient Options: (variable length) Mainly used to record a route,

or timestamps, or specify routing

HlenVers TOS. Total length

Identification Flags Fragment offset


Source IP address



HlenVers TOS. Total length

Identification Flags Fragment offset


Source IP address




6

The Transport Layer 4: UDP UDP Provides :

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead – high performance Provides best effort delivery It is unreliable:

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling


7

UDP Datagram format

Source/destination port: port numbers identify sending & receiving processes

Port number & IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (< 1024) assigned centrally, known as well known ports

Dynamic

Message length: in bytes includes the UDP header and data (min 8 max 65,535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt.)

0

Frame header Application data FCSIP header UDP header

8 Bytes


8

The Transport Layer 4: TCP TCP RFC 768 RFC 1122 Provides :

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of:

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection & handlingLimits network congestion


9

Code


Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes


10

TCP Segment Format – cont. Source/Dest port: TCP port numbers to ID applications

at both ends of connection Sequence number: First byte in segment from sender’s

byte stream Acknowledgement: identifies the number of the byte the

sender of this (ACK) segment expects to receive next Code: used to determine segment purpose, e.g. SYN,

ACK, FIN, URG Window: Advertises how much data this station is willing

to accept. Can depend on buffer space remaining. Options: used for window scaling,

SACK, timestamps, maximum segment size etc.

Code


Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding


11

TCP – providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK – “I am ready to receive byte 2048 and beyond” Sender starts timer when it sends segment – so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient – sender has to wait


12

Flow Control: Sender – Congestion Window Uses Congestion window, cwnd, a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important. ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sent,waiting for windowto open.Application writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiver’s advertisedwindow advances leading edge


13

Flow Control: Receiver – Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no.

Receiver’s advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ≠ next byte expected Duplicate ACK is send with the expected sequence number


14

How it works: TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size, the higher the throughput

Throughput = Window size / Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs increase cwnd to 4Time to reach cwnd size W TW= RTT*log2 (W) (not exactly slow!)

Rate doubles each RTT

CWND

slow start: exponential

increase

congestion avoidance: linear increase

packet loss

time

retransmit: slow start

again

timeout


15

Growth of CWND related to RTT (Most important in Congestion Avoidance phase)

Source SinkCWND= 1CWND= 1CWND= 2CWND= 2CWND= 4

TCP Slowstart AnimatedToby Rodwell Dante


16

additive increase: starting from the rough estimate, linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 segment per rtt cwnd increased by 1 /cwnd for each ACK – linear increase in rate

TCP takes packet loss as indication of congestion ! multiplicative decrease: cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 0.5 Slow start to Congestion Avoidance transition determined by ssthresh

CWNDslow start:

exponential increase


packet loss

time


again

timeout

How it works: TCP Congestion Avoidance


17

TCP Fast Retransmit & Recovery Duplicate ACKs are due to lost segments or segments out of order. Fast Retransmit: If the receiver transmits 3 duplicate ACKs

(i.e. it received 3 additional segments without getting the one expected) Sender re-transmits the missing segment

Set ssthresh to 0.5*cwnd – so enter congestion avoidance phaseSet cwnd = (0.5*cwnd +3 ) – the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into “slow start” again

At the steady state, cwnd oscillates around the optimal window size With a retransmission timeout, slow start is triggered again

CWND


increase


packet loss

time


again

timeoutCWND


increase


packet loss

time


again

timeout


18

TCP: Simple Tuning - Filling the Pipe Remember, TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on:

Bandwidth end to end, i.e. min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path: Bandwidth*Delay Product BDP = RTT*BW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segment/BW


19

Standard TCP (Reno) – What’s the problem? TCP has 2 phases:

Slowstart Probe the network to estimate the Available BWExponential growth

Congestion AvoidanceMain data transfer phase – transfer rate glows “slowly”

AIMD and High Bandwidth – Long Distance networksPoor performance of TCP in high bandwidth wide area networks is due

in part to the TCP congestion control algorithm. For each ack in a RTT without loss:

cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:

cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½

Packet loss is a killer !!


20

TCP (Reno) – Details of problem #1 Time for TCP to recover its throughput from 1 lost 1500 byte packet given by:

for rtt of ~200 ms @ 1 Gbit/s:

MSS

RTTC

*2

* 2

2 min

0.00010.0010.010.1

110

1001000

10000100000

0 50 100 150 200rtt ms

Tim

e t

o r

eco

ver

sec

10Mbit100Mbit1Gbit2.5Gbit10Gbit

UK 6 ms Europe 25 ms USA 150 ms1.6 s 26 s 28min


21

Investigation of new TCP Stacks The AIMD Algorithm – Standard TCP (Reno)

For each ack in a RTT without loss:

cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:

cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ High Speed TCP

a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner

for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect is

that there is not such a decrease in throughput. Scalable TCP

a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed.

Fast TCP

Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.

HSTCP-LP, H-TCP, BiC-TCP


22

Lets Check out this

theory about new TCP stacks

Does it matter ?

Does it work?


23

Problem #1

Packet Loss

Is it important ?


24

Packet Loss with new TCP Stacks TCP Response Function

Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel

MB-NG rtt 6ms DataTAG rtt 120 ms


25

Packet Loss and new TCP Stacks TCP Response Function

UKLight London-Chicago-London rtt 177 ms 2.6.6 Kernel

Agreement withtheory good

Some new stacksgood at high loss rates

sculcc1-chi-2 iperf 13Jan05

1

10

100

1000

100100010000100000100000010000000100000000Packet drop rate 1 in n

TC

P A

chie

vable

thro

ughput

Mbit/

s

A0 1500

A1 HSTCP

A2 Scalable

A3 HTCP

A5 BICTCP

A8 Westwood

A7 Vegas

A0 Theory

Series10

Scalable Theory

sculcc1-chi-2 iperf 13Jan05

0

100

200

300

400

500

600

700

800

900

1000

100100010000100000100000010000000100000000Packet drop rate 1 in n

TC

P A

chie

vable

thro

ughput

Mbit/

s

A0 1500

A1 HSTCP

A2 Scalable

A3 HTCP

A5 BICTCP

A8 Westwood

A7 Vegas


26

High Throughput DemonstrationsManchester rtt 6.2 ms (Geneva) rtt 128 ms

man03lon01

2.5 Gbit SDHMB-NG Core

1 GEth1 GEth

Cisco GSRCisco GSRCisco7609

Cisco7609

London (Chicago)

Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz

Send data with TCPDrop Packets

Monitor TCP with Web100


27

Drop 1 in 25,000 rtt 6.2 ms Recover in 1.6 s

High Performance TCP – MB-NG

Standard HighSpeed Scalable


28

High Performance TCP – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106

High-SpeedRapid recovery

ScalableVery fast recovery

StandardRecovery would

take ~ 20 mins


29

FAST demo via OMNInet and Datatag

J. Mambretti, F. Yeh (Northwestern)

OMNInett

NortelPassport 8600

NortelPassport 8600

Photonic Switch

NU-E (Leverone)Workstations

2 x GE

StarLight-Chicago

CalTechCisco 7609

2 x GE

Photonic Switch

Alcatel1670

10GE

10GE

Alcatel1670

2 x GE2 x GE

OC-48

DataTAG

2 x GE

Workstations CERN -Geneva

San Diego

FAST display

CERNCisco 7609

7,000 km

A. Adriaanse, C. Jin, D. Wei (Caltech)

S. Ravot (Caltech/CERN)

FAST DemoCheng Jin, David Wei

Caltech

Layer 2 path

Layer 2/3 path


30

FAST TCP vs newReno

Traffic flow Channel #1 : newRenoTraffic flow Channel #1 : newReno Traffic flowTraffic flow Channel #2: FASTChannel #2: FAST

Utilization: 70%Utilization: 70%

Utilization:Utilization:

90%90%


31

Problem #2

Is TCP fair?

look at

Round Trip Times & Max Transfer Unit


32

MTU and Fairness

Two TCP streams share a 1 Gb/s bottleneck RTT=117 ms MTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s MTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s Link utilization : 70,7 %

Starlight (Chi)Starlight (Chi)CERN (GVA)CERN (GVA)

RR RRGbE GbE SwitchSwitch

Host #1Host #1POS 2.5POS 2.5 GbpsGbps1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GE

BottleneckBottleneck

Throughput of two streams with different MTU sizes sharing a 1 Gbps bottleneck

0

100

200

300

400

500

600

700

800

900

1000

0 1000 2000 3000 4000 5000 6000

Time(s)

Thro

ughput

(Mbps)

MTU = 3000 Byte

Average over the life of the connection MTU = 3000 Byte

MTU = 9000 Byte

Average over the life of the connection MTU = 9000 Byte

Sylvain Ravot DataTag 2003


33

RTT and Fairness

SunnyvaleSunnyvaleStarlight (Chi)Starlight (Chi)

CERN (GVA)CERN (GVA)

RR RRGbE GbE SwitchSwitch

Host #1Host #1

POS 2.5POS 2.5 Gb/sGb/s1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GE

BottleneckBottleneck

RRPOS 10POS 10 Gb/sGb/sRR10GE10GE

Two TCP streams share a 1 Gb/s bottleneck CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s MTU = 9000 bytes Link utilization = 71,6 %

Throughput of two streams with different RTT sharing a 1Gbps bottleneck

0

100

200

300

400

500

600

700

800

900

1000

0 1000 2000 3000 4000 5000 6000 7000

Time (s)

Thro

ughput

(Mbps)

RTT=181ms

Average over the life of the connection RTT=181ms

RTT=117ms

Average over the life of the connection RTT=117ms

Sylvain Ravot DataTag 2003


34

Problem #n

Do TCP Flows Share the Bandwidth ?


35

Chose 3 paths from SLAC (California) Caltech (10ms), Univ Florida (80ms), CERN (180ms)

Used iperf/TCP and UDT/UDP to generate traffic

Each run was 16 minutes, in 7 regions

Test of TCP Sharing: Methodology (1Gbit/s)

Ping 1/s

Iperf or UDT

ICMP/ping traffic

TCP/UDPbottleneck

iperf

SLACCaltech/UFL/CERN

2 mins 4 mins

Les Cottrell PFLDnet 2005


36

Low performance on fast long distance paths AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) Net effect: recovers slowly, does not effectively use available bandwidth, so poor

throughput Unequal sharing

TCP Reno single stream

Congestion has a dramatic effect

Recovery is slow

Increase recovery rate

SLAC to CERN

RTT increases when achieves best throughput

Les Cottrell PFLDnet 2005

Remaining flows do not take up slack when flow removed


37

Fast

As well as packet loss, FAST uses RTT to detect congestion RTT is very stable: σ(RTT) ~ 9ms vs 37±0.14ms for the others

SLAC-CERN

Big drops in throughput which take several seconds to recover from

2nd flow never gets equal share of bandwidth


38

Hamilton TCP One of the best performers

Throughput is high Big effects on RTT when achieves best throughput Flows share equally

Appears to need >1 flow toachieve best throughput

Two flows share equally

SLAC-CERN

> 2 flows appears less stable


39

Problem #n+1

To SACK or not to SACK ?


40

The SACK Algorithm SACK Rational Non-continuous blocks of data can be ACKed Sender transmits just lost packets Helps when multiple packets lost in one TCP window

The SACK Processing is inefficient for large bandwidth delay products Sender write queue (linked list) walked for:

Each SACK blockTo mark lost packetsTo re-transmit

Processing so long input Q becomes full Get Timeouts

SACKs updated rtt 150msStandard SACKs rtt 150ms

HS-TCP

Dell 1650 2.8 GHz

PCI-X 133 MHz

Intel Pro/1000

Doug Leith

Yee-Ting Li


41

SACK …

Look into what’s happening at the algorithmic level with web100:

Strange hiccups in cwnd only correlation is SACK arrivals

Scalable TCP on MB-NG with 200mbit/sec CBR Background

Yee-Ting Li


42

Real Applications on Real Networks

Disk-2-disk applications on real networksMemory-2-memory testsTransatlantic disk-2-disk at Gigabit speedsHEP&VLBI at SC|05

Remote Computing FarmsThe effect of TCPThe effect of distance

Radio Astronomy e-VLBILeave for the talk later in the meeting


43

iperf Throughput + Web100 SuperMicro on MB-NG network HighSpeed TCP Linespeed 940 Mbit/s DupACK ? <10 (expect ~400)

BaBar on Production network Standard TCP 425 Mbit/s DupACKs 350-400 – re-transmits


44

Applications: Throughput Mbit/s HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET

bbcp

bbftp

Apachie

Gridftp

Previous work used RAID0(not disk limited)


45

Transatlantic Disk to Disk Transfers

With UKLight

SuperComputing 2004


46

bbftp: What else is going on?Scalable TCP

SuperMicro + SuperJANET Instantaneous 0 - 550 Mbit/s

Congestion window – duplicate ACK Throughput variation not TCP related?

Disk speed / bus transfer

Application architecture

BaBar + SuperJANET Instantaneous 200 – 600 Mbit/s

Disk-mem~ 590 Mbit/srememberthe end host


47

SC2004 UKLIGHT Overview

MB-NG 7600 OSRManchester

ULCC UKLight

UCL HEP

UCL network

K2

Ci

Chicago Starlight

Amsterdam

SC2004

Caltech BoothUltraLight IP

SLAC Booth

Cisco 6509

UKLight 10GFour 1GE channels

UKLight 10G

Surfnet/ EuroLink 10GTwo 1GE channels

NLR LambdaNLR-PITT-STAR-10GE-16

K2

K2 Ci

Caltech 7600


48

Transatlantic Ethernet: TCP Throughput Tests Supermicro X5DPE-G2 PCs Dual 2.9 GHz Xenon CPU FSB 533 MHz 1500 byte MTU 2.6.6 Linux Kernel Memory-memory TCP throughput Standard TCP

Wire rate throughput of 940 Mbit/s

First 10 sec

Work in progress to study: Implementation detail Advanced stacks Effect of packet loss Sharing

0

500

1000

1500

2000

0 20000 40000 60000 80000 100000 120000 140000

time ms

TCPA

chiv

e M

bit/s

0

200000000

400000000

600000000

800000000

1000000000

1200000000

1400000000

Cwnd

InstaneousBWAveBWCurCwnd (Value)

0

500

1000

1500

2000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

time ms

TCPA

chiv

e M

bit/s

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

Cwnd



49

SC2004 Disk-Disk bbftp bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 GByte file Web100 plots:

Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s)

Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s

~4.5s of overhead)

Disk-TCP-Disk at 1Gbit/s

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000

time msT

CP

Ach

ive M

bit

/s

050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd

InstaneousBW

AveBW

CurCwnd (Value)

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000

time ms

TC

PA

ch

ive M

bit

/s

050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd



50

RAID0 6disks 1 Gbyte Write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

Network & Disk Interactions (work in progress) Hosts:

Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size

Measure memory to RAID0 transfer rates with & without UDP traffic

R0 6d 1 Gbyte udp Write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

R0 6d 1 Gbyte udp9000 write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

RAID0 6disks 1 Gbyte Write 64k 3w8506-8

y = -1.017x + 178.32

y = -1.0479x + 174.440

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

8k

64k

R0 6d 1 Gbyte udp Write 64k 3w8506-8

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

8k64ky=178-1.05x

R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

syste

m m

od

e L

3+

4 8k

64k

y=178-1.05x

Disk write1735 Mbit/s

Disk write +1500 MTU UDP

1218 Mbit/sDrop of 30%

Disk write +9000 MTU UDP

1400 Mbit/sDrop of 19%

% CPU kernel mode


51

Transatlantic Transfers

With UKLight

SuperComputing 2005


52

ESLEA and UKLight

6 * 1 Gbit transatlantic Ethernet layer 2 paths UKLight + NLR

Disk-to-disk transfers with bbcp Seattle to UK Set TCP buffer and application

to give ~850Mbit/s One stream of data 840-620 Mbit/s

Stream UDP VLBI data UK to Seattle 620 Mbit/s

sc0502 SC|05

0

100

200

300

400

500

600

700

800

900

1000

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

date-time

Ra

te

M

bit/s

sc0503 SC|05

0

100

200

300

400

500

600

700

800

900

1000

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

date-time

Ra

te

M

bit/s

sc0504 SC|05

0

100

200

300

400

500

600

700

800

900

1000

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

date-time

Ra

te

M

bit/s

sc0501 SC|05

0

100

200

300

400

500

600

700

800

900

1000

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

time

Ra

te

M

bit/s

UKLight SC|05

0

500

1000

1500

2000

2500

3000

3500

4000

4500

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

date-time

Ra

te

Mb

it/s

Reverse TCP


53

SC|05 – SLAC 10 Gigabit Ethernet 2 Lightpaths:

Routed over ESnet Layer 2 over Ultra Science Net

6 Sun V20Z systems per λ 3 Transmit 3 Receive

dcache remote disk data access 100 processes per node Node sends or receives One data stream 20-30 Mbit/s

Used Netweion NICs & Chelsio TOE Data also sent to StorCloud

using fibre channel links

Traffic on the 10 GE link for 2 nodes: 3-4 Gbit per nodes 8.5-9Gbit on Trunk


54

Remote Computing Farms in the ATLAS TDAQ Experiment


55

ATLAS Remote Farms – Network Connectivity


56

ATLAS Application Protocol

Event Request EFD requests an event from SFI SFI replies with the event ~2Mbytes

Processing of event Return of computation

EF asks SFO for buffer space SFO sends OK EF transfers results of the computation

tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication.

Send OK

Send event data

Request event

●●●

Request Buffer

Send processed event

Process event

Time

Request-Response time (Histogram)

Event Filter Daemon EFD SFI and SFO


57

tcpmon: TCP Activity Manc-CERN Req-Resp

0

50000

100000

150000

200000

250000

0 200 400 600 800 1000 1200 1400 1600 1800 2000time

Data

Byte

s O

ut

0

50

100

150

200

250

300

350

400

Data

Byte

s I

n

DataBytesOut (Delta DataBytesIn (Delta Round trip time 20 ms

64 byte Request green1 Mbyte Response blue

TCP in slow start 1st event takes 19 rtt or ~ 380 ms

0

50000

100000

150000

200000

250000

0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms

Data

Byte

s O

ut

0

50000

100000

150000

200000

250000

Cu

rCw

nd

DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value

TCP Congestion windowgets re-set on each Request

TCP stack RFC 2581 & RFC 2861 reduction of Cwnd after inactivity

Even after 10s, each response takes 13 rtt or ~260 ms

020406080

100120140160180

0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms

TC

PA

ch

ive M

bit

/s

0

50000

100000

150000

200000

250000

Cw

nd

Transfer achievable throughput120 Mbit/s


58

tcpmon: TCP Activity Manc-CERN Req-RespTCP stack tuned

Round trip time 20 ms 64 byte Request green

1 Mbyte Response blue TCP starts in slow start 1st event takes 19 rtt or ~ 380 ms

0

200000

400000

600000

800000

1000000

1200000

0 500 1000 1500 2000 2500 3000time

Da

ta B

yte

s O

ut

0

50

100

150

200

250

300

350

400

Data

Byte

s I

n

DataBytesOut (Delta DataBytesIn (Delta

0100200300400

500600700800900

0 1000 2000 3000 4000 5000 6000 7000 8000time ms

TC

PA

ch

ive M

bit

/s

0

200000

400000

600000

800000

1000000

1200000

Cw

nd

0

100

200

300

400

500

600

700

800

0 500 1000 1500 2000 2500 3000time ms

nu

m P

ackets

0

200000

400000

600000

800000

1000000

1200000

Cw

nd

PktsOut (Delta PktsIn (Delta CurCwnd (Value TCP Congestion window

grows nicely Response takes 2 rtt after ~1.5s Rate ~10/s (with 50ms wait)

Transfer achievable throughputgrows to 800 Mbit/s

Data transferred WHEN theapplication requires the data


59

Round trip time 150 ms 64 byte Request green

1 Mbyte Response blue TCP starts in slow start 1st event takes 11 rtt or ~ 1.67 s

tcpmon: TCP Activity Alberta-CERN Req-RespTCP stack tuned

TCP Congestion windowin slow start to ~1.8s then congestion avoidance

Response in 2 rtt after ~2.5s Rate 2.2/s (with 50ms wait)

Transfer achievable throughputgrows slowly from 250 to 800 Mbit/s

0100000200000300000400000500000600000700000800000900000

1000000

0 1000 2000 3000 4000 5000time

Data

Byte

s O

ut

0

50

100

150

200

250

300

350

400

Data

Byte

s I

n

DataBytesOut (Delta DataBytesIn (Delta

0100

200300

400500

600700

800

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

time ms

TC

PA

ch

ive M

bit

/s

0

200000

400000

600000

800000

1000000

Cw

nd

0

100

200

300

400

500

600

700

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

time msn

um

Packets

0

200000

400000

600000

800000

1000000

Cw

nd

PktsOut (Delta PktsIn (Delta CurCwnd (Value


60

Standard TCP not optimum for high throughput long distance links

Packet loss is a killer for TCP Check on campus links & equipment, and access links to backbones Users need to collaborate with the Campus Network Teams Dante Pert

New stacks are stable and give better response & performance Still need to set the TCP buffer sizes ! Check other kernel settings e.g. window-scale maximum Watch for “TCP Stack implementation Enhancements”

TCP tries to be fair Large MTU has an advantage Short distances, small RTT, have an advantage

TCP does not share bandwidth well with other streams

The End Hosts themselves Plenty of CPU power is required for the TCP/IP stack as well and the application Packets can be lost in the IP stack due to lack of processing power Interaction between HW, protocol processing, and disk sub-system complex

Application architecture & implementation are also important The TCP protocol dynamics strongly influence the behaviour of the Application.

Users are now able to perform sustained 1 Gbit/s transfers

Summary & Conclusions


61

More Information Some URLs 1 UKLight web site: http://www.uklight.ac.uk MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:

http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests:

http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/

TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html

TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004

PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002


62

Lectures, tutorials etc. on TCP/IP: www.nv.cc.va.us/home/joney/tcp_ip.htm www.cs.pdx.edu/~jrb/tcpip.lectures.html www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm www.cis.ohio-state.edu/htbin/rfc/rfc1180.html www.jbmelectronics.com/tcp.htm

Encylopaedia http://www.freesoft.org/CIE/index.htm

TCP/IP Resources www.private.org.il/tcpip_rl.html

Understanding IP addresses http://www.3com.com/solutions/en_US/ncs/501302.html

Configuring TCP (RFC 1122) ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt

Assigned protocols, ports etc (RFC 1010) http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols

More Information Some URLs 2

http://www.nv.cc.va.us/home/joney/tcp_ip.htm

http://www.cs.pdx.edu/~jrb/tcpip.lectures.html

http://www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS

http://www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm

http://www.cis.ohio-state.edu/htbin/rfc/rfc1180.html

http://www.jbmelectronics.com/tcp.htm

http://www.freesoft.org/CIE/index.htm

http://www.freesoft.org/CIE/index.htm

http://www.private.org.il/tcpip_rl.html

http://www.3com.com/solutions/en_US/ncs/501302.html

ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt

http://www.es.net/pub/rfcs/rfc1010.txt


63

Any Questions?


64

Backup Slides


65

UDP/IP packets sent between back-to-back systems Processed in a similar manner to TCP/IP Not subject to flow control & congestion avoidance algorithms Used UDPmon test program

Latency Round trip times measured using Request-Response UDP frames Latency as a function of frame size

Slope is given by:

Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s)

Intercept indicates: processing times + HW latencies Histograms of ‘singleton’ measurements Tells us about:

Behavior of the IP stack The way the HW operates Interrupt coalescence

pathsdata dt

db1 s

Latency Measurements


66

Throughput Measurements

UDP Throughput Send a controlled stream of UDP frames spaced at regular intervals

n bytes

Number of packets

Wait timetime

Zero stats OK done

●●●

Get remote statistics Send statistics:No. receivedNo. lost + loss patternNo. out-of-orderCPU load & no. int1-way delay

Send data frames at regular intervals

●●●

Time to send Time to receive

Inter-packet time(Histogram)

Signal end of testOK done

Time

Sender Receiver


67

PCI Bus & Gigabit Ethernet Activity

PCI Activity Logic Analyzer with

PCI Probe cards in sending PC Gigabit Ethernet Fiber Probe Card PCI Probe cards in receiving PC

GigabitEthernetProbe

CPU

mem

chipset

NIC

CPU

mem

NIC

chipset

Logic AnalyserDisplay

PCI bus PCI bus

Possible Bottlenecks


68

Network switch limits behaviour End2end UDP packets from udpmon

Only 700 Mbit/s throughput

Lots of packet loss

Packet loss distributionshows throughput limited

w05gva-gig6_29May04_UDP

0

100

200

300

400

500

600

700

800

900

1000

0 5 10 15 20 25 30 35 40Spacing between frames us

Recv W

ire r

ate

Mb

its/s

50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes

w05gva-gig6_29May04_UDP

0

10

20

30

40

50

60

70

80

90

100


% P

acket

loss

50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes

w05gva-gig6_29May04_UDP wait 12us

0

2000

4000

6000

8000

10000

12000

14000

0 100 200 300 400 500 600Packet No.

1-w

ay d

ela

y u

s

0

2000

4000

6000

8000

10000

12000

14000

500 510 520 530 540 550Packet No.

1-w

ay d

ela

y u

s


69

SuperMicro P4DP8-2G (P4DP6) Dual Xeon 400/522 MHz Front side bus

6 PCI PCI-X slots 4 independent PCI buses

64 bit 66 MHz PCI 100 MHz PCI-X 133 MHz PCI-X

Dual Gigabit Ethernet Adaptec AIC-7899W

dual channel SCSI UDMA/100 bus master/EIDE channels

data transfer rates of 100 MB/sec burst

“Server Quality” Motherboards


70

“Server Quality” Motherboards

Boston/Supermicro H8DAR Two Dual Core Opterons 200 MHz DDR Memory

Theory BW: 6.4Gbit

HyperTransport

2 independent PCI buses 133 MHz PCI-X

2 Gigabit Ethernet SATA

( PCI-e )


71

10 Gigabit Ethernet: UDP Throughput

1500 byte MTU gives ~ 2 Gbit/s Used 16144 byte MTU max user length 16080 DataTAG Supermicro PCs Dual 2.2 GHz Xenon CPU FSB 400 MHz PCI-X mmrbc 512 bytes wire rate throughput of 2.9 Gbit/s

CERN OpenLab HP Itanium PCs Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz PCI-X mmrbc 4096 bytes wire rate of 5.7 Gbit/s

SLAC Dell PCs giving a Dual 3.0 GHz Xenon CPU FSB 533 MHz PCI-X mmrbc 4096 bytes wire rate of 5.4 Gbit/s

an-al 10GE Xsum 512kbuf MTU16114 27Oct03

0

1000

2000

3000

4000

5000

6000


Rec

v W

ire

rate

Mb

its/

s

16080 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes


72

10 Gigabit Ethernet: Tuning PCI-X

16080 byte packets every 200 µs Intel PRO/10GbE LR Adapter PCI-X bus occupancy vs mmrbc

Measured times Times based on PCI-X times from

the logic analyser Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s

mmrbc1024 bytes

mmrbc2048 bytes

mmrbc4096 bytes5.7Gbit/s

mmrbc512 bytes

CSR Access

PCI-X Sequence

Data Transfer

Interrupt & CSR UpdateKernel 2.6.1#17 HP Itanium Intel10GE Feb04

0

2

4

6

8

10

0 1000 2000 3000 4000 5000Max Memory Read Byte Count

PC

I-X

Tra

nsfe

r tim

e

us

measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X

DataTAG Xeon 2.2 GHz

0

2

4

6

8

10

0 1000 2000 3000 4000 5000Max Memory Read Byte Count

PC

I-X

Tra

nsfe

r tim

e

us

measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X


73

Congestion control: ACK clocking


74

End Hosts & NICs CERN-nat-Manc.

Request-Response Latency

Throughput Packet Loss Re-Order Use UDP packets to characterise Host, NIC & Network

SuperMicro P4DP8 motherboard Dual Xenon 2.2GHz CPU 400 MHz System bus 64 bit 66 MHz PCI / 133 MHz PCI-X bus

pcatb121-nat-gig6_13Aug04

0100200300400500600700800900

1000

0 10 20 30 40

Spacing between frames us

Rec

v W

ire r

ate

Mbi

ts/s

50 bytes

100 bytes

200 bytes

400 bytes

600 bytes

800 bytes

1000 bytes

1200 bytes

1400 bytes

1472 bytes


0

20

40

60

80


% P

acke

t lo

ss 50 bytes

100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes


0

5

10

15


num

re-

orde

red

50 bytes

100 bytes 200 bytes

400 bytes 600 bytes

800 bytes 1000 bytes

1200 bytes 1400 bytes

1472 bytes

256 bytes pcatb121-nat-gig6

0

1000

2000

3000

4000

5000

6000

20900 21100 21300 21500Latency us

N(t

)


0

2000

4000

6000

8000

10000

20900 21100 21300 21500Latency us

N(t

)


0

1000

2000

3000

4000

5000

20900 21100 21300 21500Latency us

N(t

)

The network can sustain 1Gbps of UDP traffic The average server can loose smaller packets Packet loss caused by lack of power in the PC receiving

the traffic Out of order packets due to WAN routers Lightpaths look like extended LANS

have no re-ordering


75

tcpdump / tcptrace tcpdump: dump all TCP header information for a specified

source/destination ftp://ftp.ee.lbl.gov/

tcptrace: format tcpdump output for analysis using xplot http://www.tcptrace.org/ NLANR TCP Testrig : Nice wrapper for tcpdump and tcptrace tools

http://www.ncne.nlanr.net/TCP/testrig/

Sample use: tcpdump -s 100 -w /tmp/tcpdump.out host hostname tcptrace -Sl /tmp/tcpdump.out xplot /tmp/a2b_tsg.xpl


76

tcptrace and xplot

X axis is time Y axis is sequence number the slope of this curve gives the throughput over time. xplot tool make it easy to zoom in


77

Zoomed In View Green Line: ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received. Yellow Ticks track the window advertisements that were the same as the

last advertisement. White Arrows represent segments sent. Red Arrows (R) represent retransmitted segments


78

TCP Slow Start

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 1...

Documents

Transcript of 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 1...