5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 1...
-
date post
22-Dec-2015 -
Category
Documents
-
view
214 -
download
1
Transcript of 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester 1...
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
1
TCP/IP on
High Bandwidth Long Distance Pathsor
So TCP works … but still the users ask:
Where is my throughput?
Richard Hughes-Jones The University of Manchester
www.hep.man.ac.uk/~rich/ then “Talks” and look for Haystack
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
2
Layers & IP
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
3
The Network Layer 3: IP IP Layer properties:
Provides best effort delivery It is unreliable
Packet may be lostDuplicatedOut of order
Connection less Provides logical addresses Provides routing Demultiplex data on protocol number
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
4
The Internet datagram
31HlenVers Type of serv. Total length
0 8 16
Identification Flags
244
Fragment offset
19
TTL Protocol Header Checksum
Source IP address
Destination IP address
IP Options (if any) Padding
20 Bytes
Frame header Transport FCSIP header
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
5
IP Datagram Format (cont.) Type of Service – TOS:
now being used for QoS Total length: length of datagram
in bytes, includes header and data Time to live – TTL: specifies how long datagram is
allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops
Protocol: specifies the format of the data area Protocol numbers administered by central authority to guarantee
agreement, e.g. ICMP=1, TCP=6, UDP=17 … Source & destination IP address: (32 bits each) contain
IP address of sender and intended recipient Options: (variable length) Mainly used to record a route,
or timestamps, or specify routing
HlenVers TOS. Total length
Identification Flags Fragment offset
TTL Protocol Header Checksum
Source IP address
Destination IP address
IP Options (if any) Padding
HlenVers TOS. Total length
Identification Flags Fragment offset
TTL Protocol Header Checksum
Source IP address
Destination IP address
IP Options (if any) Padding
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
6
The Transport Layer 4: UDP UDP Provides :
Connection less service over IPNo setup teardownOne packet at a time
Minimal overhead – high performance Provides best effort delivery It is unreliable:
Packet may be lostDuplicatedOut of order
Application is responsible for Data reliabilityFlow controlError handling
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
7
UDP Datagram format
Source/destination port: port numbers identify sending & receiving processes
Port number & IP address allow any application on Internet to be uniquely identified
Ports can be static or dynamic
Static (< 1024) assigned centrally, known as well known ports
Dynamic
Message length: in bytes includes the UDP header and data (min 8 max 65,535)
8 16 3124
Source port Destination port
UDP message len Checksum (opt.)
0
Frame header Application data FCSIP header UDP header
8 Bytes
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
8
The Transport Layer 4: TCP TCP RFC 768 RFC 1122 Provides :
Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed
Reliable end-to-end Byte Stream delivery over unreliable network It takes care of:
Lost packets Duplicated packetsOut of order packets
TCP provides Data buffering Flow controlError detection & handlingLimits network congestion
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
9
Code
Source port Destination port
Sequence number
0 8 16 3124
Acknowledgement number
4
Hlen
10
Resv Window
Urgent ptrChecksum
Options (if any) Padding
The TCP Segment FormatFrame header Application data FCSIP header TCP header
20 Bytes
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
10
TCP Segment Format – cont. Source/Dest port: TCP port numbers to ID applications
at both ends of connection Sequence number: First byte in segment from sender’s
byte stream Acknowledgement: identifies the number of the byte the
sender of this (ACK) segment expects to receive next Code: used to determine segment purpose, e.g. SYN,
ACK, FIN, URG Window: Advertises how much data this station is willing
to accept. Can depend on buffer space remaining. Options: used for window scaling,
SACK, timestamps, maximum segment size etc.
Code
Source port Destination port
Sequence number
Acknowledgement number
Hlen Resv Window
Urgent ptrChecksum
Options (if any) Padding
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
11
TCP – providing reliability Positive acknowledgement (ACK) of each received segment
Sender keeps record of each segment sent Sender awaits an ACK – “I am ready to receive byte 2048 and beyond” Sender starts timer when it sends segment – so can re-transmit
Segment n
ACK of Segment nRTT
Time
Sender Receiver
Sequence 1024Length 1024
Ack 2048
Segment n+1
ACK of Segment n +1RTT
Sequence 2048Length 1024
Ack 3072
Inefficient – sender has to wait
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
12
Flow Control: Sender – Congestion Window Uses Congestion window, cwnd, a sliding window to control the data flow
Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important. ACK gives next sequence no to receive AND
The available space in the receive buffer Timer kept for each packet
Unsent Datamay be transmitted immediately
Sent Databuffered waiting ACK
TCP Cwnd slides Data to be sent,waiting for windowto open.Application writes here
Data sent and ACKed
Sending hostadvances markeras data transmitted
Received ACKadvances trailing edge
Receiver’s advertisedwindow advances leading edge
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
13
Flow Control: Receiver – Lost Data
Received butnot ACKed
ACKed but not given to user
Window slides
Lost data
Data given to application
Last ACK givenNext byte expectedExpected sequence no.
Receiver’s advertisedwindow advances leading edge
Application reads here
If new data is received with a sequence number ≠ next byte expected Duplicate ACK is send with the expected sequence number
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
14
How it works: TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size, the higher the throughput
Throughput = Window size / Round-trip Time exponentially increase the congestion window size until a packet is lost
cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs increase cwnd to 4Time to reach cwnd size W TW= RTT*log2 (W) (not exactly slow!)
Rate doubles each RTT
CWND
slow start: exponential
increase
congestion avoidance: linear increase
packet loss
time
retransmit: slow start
again
timeout
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
15
Growth of CWND related to RTT (Most important in Congestion Avoidance phase)
Source SinkCWND= 1CWND= 1CWND= 2CWND= 2CWND= 4
TCP Slowstart AnimatedToby Rodwell Dante
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
16
additive increase: starting from the rough estimate, linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 segment per rtt cwnd increased by 1 /cwnd for each ACK – linear increase in rate
TCP takes packet loss as indication of congestion ! multiplicative decrease: cut the congestion window size
aggressively if a packet is lost Standard TCP reduces cwnd by 0.5 Slow start to Congestion Avoidance transition determined by ssthresh
CWNDslow start:
exponential increase
congestion avoidance: linear increase
packet loss
time
retransmit: slow start
again
timeout
How it works: TCP Congestion Avoidance
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
17
TCP Fast Retransmit & Recovery Duplicate ACKs are due to lost segments or segments out of order. Fast Retransmit: If the receiver transmits 3 duplicate ACKs
(i.e. it received 3 additional segments without getting the one expected) Sender re-transmits the missing segment
Set ssthresh to 0.5*cwnd – so enter congestion avoidance phaseSet cwnd = (0.5*cwnd +3 ) – the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK
no need to go into “slow start” again
At the steady state, cwnd oscillates around the optimal window size With a retransmission timeout, slow start is triggered again
CWND
slow start: exponential
increase
congestion avoidance: linear increase
packet loss
time
retransmit: slow start
again
timeoutCWND
slow start: exponential
increase
congestion avoidance: linear increase
packet loss
time
retransmit: slow start
again
timeout
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
18
TCP: Simple Tuning - Filling the Pipe Remember, TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on:
Bandwidth end to end, i.e. min(BWlinks) AKA bottleneck bandwidth
Round Trip Time (RTT)
The number of bytes in flight to fill the entire path: Bandwidth*Delay Product BDP = RTT*BW Can increase bandwidth by
orders of magnitude
Windows also used for flow controlRTT
Time
Sender Receiver
ACK
Segment time on wire = bits in segment/BW
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
19
Standard TCP (Reno) – What’s the problem? TCP has 2 phases:
Slowstart Probe the network to estimate the Available BWExponential growth
Congestion AvoidanceMain data transfer phase – transfer rate glows “slowly”
AIMD and High Bandwidth – Long Distance networksPoor performance of TCP in high bandwidth wide area networks is due
in part to the TCP congestion control algorithm. For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:
cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½
Packet loss is a killer !!
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
20
TCP (Reno) – Details of problem #1 Time for TCP to recover its throughput from 1 lost 1500 byte packet given by:
for rtt of ~200 ms @ 1 Gbit/s:
MSS
RTTC
*2
* 2
2 min
0.00010.0010.010.1
110
1001000
10000100000
0 50 100 150 200rtt ms
Tim
e t
o r
eco
ver
sec
10Mbit100Mbit1Gbit2.5Gbit10Gbit
UK 6 ms Europe 25 ms USA 150 ms1.6 s 26 s 28min
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
21
Investigation of new TCP Stacks The AIMD Algorithm – Standard TCP (Reno)
For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:
cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ High Speed TCP
a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner
for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect is
that there is not such a decrease in throughput. Scalable TCP
a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed.
Fast TCP
Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.
HSTCP-LP, H-TCP, BiC-TCP
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
22
Lets Check out this
theory about new TCP stacks
Does it matter ?
Does it work?
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
23
Problem #1
Packet Loss
Is it important ?
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
24
Packet Loss with new TCP Stacks TCP Response Function
Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel
MB-NG rtt 6ms DataTAG rtt 120 ms
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
25
Packet Loss and new TCP Stacks TCP Response Function
UKLight London-Chicago-London rtt 177 ms 2.6.6 Kernel
Agreement withtheory good
Some new stacksgood at high loss rates
sculcc1-chi-2 iperf 13Jan05
1
10
100
1000
100100010000100000100000010000000100000000Packet drop rate 1 in n
TC
P A
chie
vable
thro
ughput
Mbit/
s
A0 1500
A1 HSTCP
A2 Scalable
A3 HTCP
A5 BICTCP
A8 Westwood
A7 Vegas
A0 Theory
Series10
Scalable Theory
sculcc1-chi-2 iperf 13Jan05
0
100
200
300
400
500
600
700
800
900
1000
100100010000100000100000010000000100000000Packet drop rate 1 in n
TC
P A
chie
vable
thro
ughput
Mbit/
s
A0 1500
A1 HSTCP
A2 Scalable
A3 HTCP
A5 BICTCP
A8 Westwood
A7 Vegas
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
26
High Throughput DemonstrationsManchester rtt 6.2 ms (Geneva) rtt 128 ms
man03lon01
2.5 Gbit SDHMB-NG Core
1 GEth1 GEth
Cisco GSRCisco GSRCisco7609
Cisco7609
London (Chicago)
Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz
Send data with TCPDrop Packets
Monitor TCP with Web100
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
27
Drop 1 in 25,000 rtt 6.2 ms Recover in 1.6 s
High Performance TCP – MB-NG
Standard HighSpeed Scalable
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
28
High Performance TCP – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106
High-SpeedRapid recovery
ScalableVery fast recovery
StandardRecovery would
take ~ 20 mins
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
29
FAST demo via OMNInet and Datatag
J. Mambretti, F. Yeh (Northwestern)
OMNInett
NortelPassport 8600
NortelPassport 8600
Photonic Switch
NU-E (Leverone)Workstations
2 x GE
StarLight-Chicago
CalTechCisco 7609
2 x GE
Photonic Switch
Alcatel1670
10GE
10GE
Alcatel1670
2 x GE2 x GE
OC-48
DataTAG
2 x GE
Workstations CERN -Geneva
San Diego
FAST display
CERNCisco 7609
7,000 km
A. Adriaanse, C. Jin, D. Wei (Caltech)
S. Ravot (Caltech/CERN)
FAST DemoCheng Jin, David Wei
Caltech
Layer 2 path
Layer 2/3 path
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
30
FAST TCP vs newReno
Traffic flow Channel #1 : newRenoTraffic flow Channel #1 : newReno Traffic flowTraffic flow Channel #2: FASTChannel #2: FAST
Utilization: 70%Utilization: 70%
Utilization:Utilization:
90%90%
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
31
Problem #2
Is TCP fair?
look at
Round Trip Times & Max Transfer Unit
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
32
MTU and Fairness
Two TCP streams share a 1 Gb/s bottleneck RTT=117 ms MTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s MTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s Link utilization : 70,7 %
Starlight (Chi)Starlight (Chi)CERN (GVA)CERN (GVA)
RR RRGbE GbE SwitchSwitch
Host #1Host #1POS 2.5POS 2.5 GbpsGbps1 GE1 GE
1 GE1 GE
Host #2Host #2
Host #1Host #1
Host #2Host #2
1 GE1 GE
1 GE1 GE
BottleneckBottleneck
Throughput of two streams with different MTU sizes sharing a 1 Gbps bottleneck
0
100
200
300
400
500
600
700
800
900
1000
0 1000 2000 3000 4000 5000 6000
Time(s)
Thro
ughput
(Mbps)
MTU = 3000 Byte
Average over the life of the connection MTU = 3000 Byte
MTU = 9000 Byte
Average over the life of the connection MTU = 9000 Byte
Sylvain Ravot DataTag 2003
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
33
RTT and Fairness
SunnyvaleSunnyvaleStarlight (Chi)Starlight (Chi)
CERN (GVA)CERN (GVA)
RR RRGbE GbE SwitchSwitch
Host #1Host #1
POS 2.5POS 2.5 Gb/sGb/s1 GE1 GE
1 GE1 GE
Host #2Host #2
Host #1Host #1
Host #2Host #2
1 GE1 GE
1 GE1 GE
BottleneckBottleneck
RRPOS 10POS 10 Gb/sGb/sRR10GE10GE
Two TCP streams share a 1 Gb/s bottleneck CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s MTU = 9000 bytes Link utilization = 71,6 %
Throughput of two streams with different RTT sharing a 1Gbps bottleneck
0
100
200
300
400
500
600
700
800
900
1000
0 1000 2000 3000 4000 5000 6000 7000
Time (s)
Thro
ughput
(Mbps)
RTT=181ms
Average over the life of the connection RTT=181ms
RTT=117ms
Average over the life of the connection RTT=117ms
Sylvain Ravot DataTag 2003
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
34
Problem #n
Do TCP Flows Share the Bandwidth ?
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
35
Chose 3 paths from SLAC (California) Caltech (10ms), Univ Florida (80ms), CERN (180ms)
Used iperf/TCP and UDT/UDP to generate traffic
Each run was 16 minutes, in 7 regions
Test of TCP Sharing: Methodology (1Gbit/s)
Ping 1/s
Iperf or UDT
ICMP/ping traffic
TCP/UDPbottleneck
iperf
SLACCaltech/UFL/CERN
2 mins 4 mins
Les Cottrell PFLDnet 2005
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
36
Low performance on fast long distance paths AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) Net effect: recovers slowly, does not effectively use available bandwidth, so poor
throughput Unequal sharing
TCP Reno single stream
Congestion has a dramatic effect
Recovery is slow
Increase recovery rate
SLAC to CERN
RTT increases when achieves best throughput
Les Cottrell PFLDnet 2005
Remaining flows do not take up slack when flow removed
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
37
Fast
As well as packet loss, FAST uses RTT to detect congestion RTT is very stable: σ(RTT) ~ 9ms vs 37±0.14ms for the others
SLAC-CERN
Big drops in throughput which take several seconds to recover from
2nd flow never gets equal share of bandwidth
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
38
Hamilton TCP One of the best performers
Throughput is high Big effects on RTT when achieves best throughput Flows share equally
Appears to need >1 flow toachieve best throughput
Two flows share equally
SLAC-CERN
> 2 flows appears less stable
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
39
Problem #n+1
To SACK or not to SACK ?
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
40
The SACK Algorithm SACK Rational Non-continuous blocks of data can be ACKed Sender transmits just lost packets Helps when multiple packets lost in one TCP window
The SACK Processing is inefficient for large bandwidth delay products Sender write queue (linked list) walked for:
Each SACK blockTo mark lost packetsTo re-transmit
Processing so long input Q becomes full Get Timeouts
SACKs updated rtt 150msStandard SACKs rtt 150ms
HS-TCP
Dell 1650 2.8 GHz
PCI-X 133 MHz
Intel Pro/1000
Doug Leith
Yee-Ting Li
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
41
SACK …
Look into what’s happening at the algorithmic level with web100:
Strange hiccups in cwnd only correlation is SACK arrivals
Scalable TCP on MB-NG with 200mbit/sec CBR Background
Yee-Ting Li
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
42
Real Applications on Real Networks
Disk-2-disk applications on real networksMemory-2-memory testsTransatlantic disk-2-disk at Gigabit speedsHEP&VLBI at SC|05
Remote Computing FarmsThe effect of TCPThe effect of distance
Radio Astronomy e-VLBILeave for the talk later in the meeting
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
43
iperf Throughput + Web100 SuperMicro on MB-NG network HighSpeed TCP Linespeed 940 Mbit/s DupACK ? <10 (expect ~400)
BaBar on Production network Standard TCP 425 Mbit/s DupACKs 350-400 – re-transmits
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
44
Applications: Throughput Mbit/s HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET
bbcp
bbftp
Apachie
Gridftp
Previous work used RAID0(not disk limited)
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
45
Transatlantic Disk to Disk Transfers
With UKLight
SuperComputing 2004
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
46
bbftp: What else is going on?Scalable TCP
SuperMicro + SuperJANET Instantaneous 0 - 550 Mbit/s
Congestion window – duplicate ACK Throughput variation not TCP related?
Disk speed / bus transfer
Application architecture
BaBar + SuperJANET Instantaneous 200 – 600 Mbit/s
Disk-mem~ 590 Mbit/srememberthe end host
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
47
SC2004 UKLIGHT Overview
MB-NG 7600 OSRManchester
ULCC UKLight
UCL HEP
UCL network
K2
Ci
Chicago Starlight
Amsterdam
SC2004
Caltech BoothUltraLight IP
SLAC Booth
Cisco 6509
UKLight 10GFour 1GE channels
UKLight 10G
Surfnet/ EuroLink 10GTwo 1GE channels
NLR LambdaNLR-PITT-STAR-10GE-16
K2
K2 Ci
Caltech 7600
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
48
Transatlantic Ethernet: TCP Throughput Tests Supermicro X5DPE-G2 PCs Dual 2.9 GHz Xenon CPU FSB 533 MHz 1500 byte MTU 2.6.6 Linux Kernel Memory-memory TCP throughput Standard TCP
Wire rate throughput of 940 Mbit/s
First 10 sec
Work in progress to study: Implementation detail Advanced stacks Effect of packet loss Sharing
0
500
1000
1500
2000
0 20000 40000 60000 80000 100000 120000 140000
time ms
TCPA
chiv
e M
bit/s
0
200000000
400000000
600000000
800000000
1000000000
1200000000
1400000000
Cwnd
InstaneousBWAveBWCurCwnd (Value)
0
500
1000
1500
2000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
time ms
TCPA
chiv
e M
bit/s
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
Cwnd
InstaneousBWAveBWCurCwnd (Value)
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
49
SC2004 Disk-Disk bbftp bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 GByte file Web100 plots:
Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s)
Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s
~4.5s of overhead)
Disk-TCP-Disk at 1Gbit/s
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time msT
CP
Ach
ive M
bit
/s
050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBW
AveBW
CurCwnd (Value)
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time ms
TC
PA
ch
ive M
bit
/s
050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBWAveBWCurCwnd (Value)
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
50
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
Network & Disk Interactions (work in progress) Hosts:
Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size
Measure memory to RAID0 transfer rates with & without UDP traffic
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
R0 6d 1 Gbyte udp9000 write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
y = -1.017x + 178.32
y = -1.0479x + 174.440
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k
64k
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k64ky=178-1.05x
R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4 8k
64k
y=178-1.05x
Disk write1735 Mbit/s
Disk write +1500 MTU UDP
1218 Mbit/sDrop of 30%
Disk write +9000 MTU UDP
1400 Mbit/sDrop of 19%
% CPU kernel mode
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
51
Transatlantic Transfers
With UKLight
SuperComputing 2005
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
52
ESLEA and UKLight
6 * 1 Gbit transatlantic Ethernet layer 2 paths UKLight + NLR
Disk-to-disk transfers with bbcp Seattle to UK Set TCP buffer and application
to give ~850Mbit/s One stream of data 840-620 Mbit/s
Stream UDP VLBI data UK to Seattle 620 Mbit/s
sc0502 SC|05
0
100
200
300
400
500
600
700
800
900
1000
16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00
date-time
Ra
te
M
bit/s
sc0503 SC|05
0
100
200
300
400
500
600
700
800
900
1000
16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00
date-time
Ra
te
M
bit/s
sc0504 SC|05
0
100
200
300
400
500
600
700
800
900
1000
16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00
date-time
Ra
te
M
bit/s
sc0501 SC|05
0
100
200
300
400
500
600
700
800
900
1000
16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00
time
Ra
te
M
bit/s
UKLight SC|05
0
500
1000
1500
2000
2500
3000
3500
4000
4500
16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00
date-time
Ra
te
Mb
it/s
Reverse TCP
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
53
SC|05 – SLAC 10 Gigabit Ethernet 2 Lightpaths:
Routed over ESnet Layer 2 over Ultra Science Net
6 Sun V20Z systems per λ 3 Transmit 3 Receive
dcache remote disk data access 100 processes per node Node sends or receives One data stream 20-30 Mbit/s
Used Netweion NICs & Chelsio TOE Data also sent to StorCloud
using fibre channel links
Traffic on the 10 GE link for 2 nodes: 3-4 Gbit per nodes 8.5-9Gbit on Trunk
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
54
Remote Computing Farms in the ATLAS TDAQ Experiment
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
55
ATLAS Remote Farms – Network Connectivity
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
56
ATLAS Application Protocol
Event Request EFD requests an event from SFI SFI replies with the event ~2Mbytes
Processing of event Return of computation
EF asks SFO for buffer space SFO sends OK EF transfers results of the computation
tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication.
Send OK
Send event data
Request event
●●●
Request Buffer
Send processed event
Process event
Time
Request-Response time (Histogram)
Event Filter Daemon EFD SFI and SFO
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
57
tcpmon: TCP Activity Manc-CERN Req-Resp
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000 1200 1400 1600 1800 2000time
Data
Byte
s O
ut
0
50
100
150
200
250
300
350
400
Data
Byte
s I
n
DataBytesOut (Delta DataBytesIn (Delta Round trip time 20 ms
64 byte Request green1 Mbyte Response blue
TCP in slow start 1st event takes 19 rtt or ~ 380 ms
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms
Data
Byte
s O
ut
0
50000
100000
150000
200000
250000
Cu
rCw
nd
DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value
TCP Congestion windowgets re-set on each Request
TCP stack RFC 2581 & RFC 2861 reduction of Cwnd after inactivity
Even after 10s, each response takes 13 rtt or ~260 ms
020406080
100120140160180
0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms
TC
PA
ch
ive M
bit
/s
0
50000
100000
150000
200000
250000
Cw
nd
Transfer achievable throughput120 Mbit/s
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
58
tcpmon: TCP Activity Manc-CERN Req-RespTCP stack tuned
Round trip time 20 ms 64 byte Request green
1 Mbyte Response blue TCP starts in slow start 1st event takes 19 rtt or ~ 380 ms
0
200000
400000
600000
800000
1000000
1200000
0 500 1000 1500 2000 2500 3000time
Da
ta B
yte
s O
ut
0
50
100
150
200
250
300
350
400
Data
Byte
s I
n
DataBytesOut (Delta DataBytesIn (Delta
0100200300400
500600700800900
0 1000 2000 3000 4000 5000 6000 7000 8000time ms
TC
PA
ch
ive M
bit
/s
0
200000
400000
600000
800000
1000000
1200000
Cw
nd
0
100
200
300
400
500
600
700
800
0 500 1000 1500 2000 2500 3000time ms
nu
m P
ackets
0
200000
400000
600000
800000
1000000
1200000
Cw
nd
PktsOut (Delta PktsIn (Delta CurCwnd (Value TCP Congestion window
grows nicely Response takes 2 rtt after ~1.5s Rate ~10/s (with 50ms wait)
Transfer achievable throughputgrows to 800 Mbit/s
Data transferred WHEN theapplication requires the data
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
59
Round trip time 150 ms 64 byte Request green
1 Mbyte Response blue TCP starts in slow start 1st event takes 11 rtt or ~ 1.67 s
tcpmon: TCP Activity Alberta-CERN Req-RespTCP stack tuned
TCP Congestion windowin slow start to ~1.8s then congestion avoidance
Response in 2 rtt after ~2.5s Rate 2.2/s (with 50ms wait)
Transfer achievable throughputgrows slowly from 250 to 800 Mbit/s
0100000200000300000400000500000600000700000800000900000
1000000
0 1000 2000 3000 4000 5000time
Data
Byte
s O
ut
0
50
100
150
200
250
300
350
400
Data
Byte
s I
n
DataBytesOut (Delta DataBytesIn (Delta
0100
200300
400500
600700
800
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
time ms
TC
PA
ch
ive M
bit
/s
0
200000
400000
600000
800000
1000000
Cw
nd
0
100
200
300
400
500
600
700
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
time msn
um
Packets
0
200000
400000
600000
800000
1000000
Cw
nd
PktsOut (Delta PktsIn (Delta CurCwnd (Value
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
60
Standard TCP not optimum for high throughput long distance links
Packet loss is a killer for TCP Check on campus links & equipment, and access links to backbones Users need to collaborate with the Campus Network Teams Dante Pert
New stacks are stable and give better response & performance Still need to set the TCP buffer sizes ! Check other kernel settings e.g. window-scale maximum Watch for “TCP Stack implementation Enhancements”
TCP tries to be fair Large MTU has an advantage Short distances, small RTT, have an advantage
TCP does not share bandwidth well with other streams
The End Hosts themselves Plenty of CPU power is required for the TCP/IP stack as well and the application Packets can be lost in the IP stack due to lack of processing power Interaction between HW, protocol processing, and disk sub-system complex
Application architecture & implementation are also important The TCP protocol dynamics strongly influence the behaviour of the Application.
Users are now able to perform sustained 1 Gbit/s transfers
Summary & Conclusions
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
61
More Information Some URLs 1 UKLight web site: http://www.uklight.ac.uk MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests:
http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/
TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html
TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004
PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
62
Lectures, tutorials etc. on TCP/IP: www.nv.cc.va.us/home/joney/tcp_ip.htm www.cs.pdx.edu/~jrb/tcpip.lectures.html www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm www.cis.ohio-state.edu/htbin/rfc/rfc1180.html www.jbmelectronics.com/tcp.htm
Encylopaedia http://www.freesoft.org/CIE/index.htm
TCP/IP Resources www.private.org.il/tcpip_rl.html
Understanding IP addresses http://www.3com.com/solutions/en_US/ncs/501302.html
Configuring TCP (RFC 1122) ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt
Assigned protocols, ports etc (RFC 1010) http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols
More Information Some URLs 2
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
63
Any Questions?
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
64
Backup Slides
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
65
UDP/IP packets sent between back-to-back systems Processed in a similar manner to TCP/IP Not subject to flow control & congestion avoidance algorithms Used UDPmon test program
Latency Round trip times measured using Request-Response UDP frames Latency as a function of frame size
Slope is given by:
Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s)
Intercept indicates: processing times + HW latencies Histograms of ‘singleton’ measurements Tells us about:
Behavior of the IP stack The way the HW operates Interrupt coalescence
pathsdata dt
db1 s
Latency Measurements
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
66
Throughput Measurements
UDP Throughput Send a controlled stream of UDP frames spaced at regular intervals
n bytes
Number of packets
Wait timetime
Zero stats OK done
●●●
Get remote statistics Send statistics:No. receivedNo. lost + loss patternNo. out-of-orderCPU load & no. int1-way delay
Send data frames at regular intervals
●●●
Time to send Time to receive
Inter-packet time(Histogram)
Signal end of testOK done
Time
Sender Receiver
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
67
PCI Bus & Gigabit Ethernet Activity
PCI Activity Logic Analyzer with
PCI Probe cards in sending PC Gigabit Ethernet Fiber Probe Card PCI Probe cards in receiving PC
GigabitEthernetProbe
CPU
mem
chipset
NIC
CPU
mem
NIC
chipset
Logic AnalyserDisplay
PCI bus PCI bus
Possible Bottlenecks
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
68
Network switch limits behaviour End2end UDP packets from udpmon
Only 700 Mbit/s throughput
Lots of packet loss
Packet loss distributionshows throughput limited
w05gva-gig6_29May04_UDP
0
100
200
300
400
500
600
700
800
900
1000
0 5 10 15 20 25 30 35 40Spacing between frames us
Recv W
ire r
ate
Mb
its/s
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
w05gva-gig6_29May04_UDP
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% P
acket
loss
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
w05gva-gig6_29May04_UDP wait 12us
0
2000
4000
6000
8000
10000
12000
14000
0 100 200 300 400 500 600Packet No.
1-w
ay d
ela
y u
s
0
2000
4000
6000
8000
10000
12000
14000
500 510 520 530 540 550Packet No.
1-w
ay d
ela
y u
s
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
69
SuperMicro P4DP8-2G (P4DP6) Dual Xeon 400/522 MHz Front side bus
6 PCI PCI-X slots 4 independent PCI buses
64 bit 66 MHz PCI 100 MHz PCI-X 133 MHz PCI-X
Dual Gigabit Ethernet Adaptec AIC-7899W
dual channel SCSI UDMA/100 bus master/EIDE channels
data transfer rates of 100 MB/sec burst
“Server Quality” Motherboards
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
70
“Server Quality” Motherboards
Boston/Supermicro H8DAR Two Dual Core Opterons 200 MHz DDR Memory
Theory BW: 6.4Gbit
HyperTransport
2 independent PCI buses 133 MHz PCI-X
2 Gigabit Ethernet SATA
( PCI-e )
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
71
10 Gigabit Ethernet: UDP Throughput
1500 byte MTU gives ~ 2 Gbit/s Used 16144 byte MTU max user length 16080 DataTAG Supermicro PCs Dual 2.2 GHz Xenon CPU FSB 400 MHz PCI-X mmrbc 512 bytes wire rate throughput of 2.9 Gbit/s
CERN OpenLab HP Itanium PCs Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz PCI-X mmrbc 4096 bytes wire rate of 5.7 Gbit/s
SLAC Dell PCs giving a Dual 3.0 GHz Xenon CPU FSB 533 MHz PCI-X mmrbc 4096 bytes wire rate of 5.4 Gbit/s
an-al 10GE Xsum 512kbuf MTU16114 27Oct03
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30 35 40Spacing between frames us
Rec
v W
ire
rate
Mb
its/
s
16080 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
72
10 Gigabit Ethernet: Tuning PCI-X
16080 byte packets every 200 µs Intel PRO/10GbE LR Adapter PCI-X bus occupancy vs mmrbc
Measured times Times based on PCI-X times from
the logic analyser Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s
mmrbc1024 bytes
mmrbc2048 bytes
mmrbc4096 bytes5.7Gbit/s
mmrbc512 bytes
CSR Access
PCI-X Sequence
Data Transfer
Interrupt & CSR UpdateKernel 2.6.1#17 HP Itanium Intel10GE Feb04
0
2
4
6
8
10
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e
us
measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X
DataTAG Xeon 2.2 GHz
0
2
4
6
8
10
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e
us
measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
73
Congestion control: ACK clocking
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
74
End Hosts & NICs CERN-nat-Manc.
Request-Response Latency
Throughput Packet Loss Re-Order Use UDP packets to characterise Host, NIC & Network
SuperMicro P4DP8 motherboard Dual Xenon 2.2GHz CPU 400 MHz System bus 64 bit 66 MHz PCI / 133 MHz PCI-X bus
pcatb121-nat-gig6_13Aug04
0100200300400500600700800900
1000
0 10 20 30 40
Spacing between frames us
Rec
v W
ire r
ate
Mbi
ts/s
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
pcatb121-nat-gig6_13Aug04
0
20
40
60
80
0 5 10 15 20 25 30 35 40Spacing between frames us
% P
acke
t lo
ss 50 bytes
100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
pcatb121-nat-gig6_13Aug04
0
5
10
15
0 5 10 15 20 25 30 35 40Spacing between frames us
num
re-
orde
red
50 bytes
100 bytes 200 bytes
400 bytes 600 bytes
800 bytes 1000 bytes
1200 bytes 1400 bytes
1472 bytes
256 bytes pcatb121-nat-gig6
0
1000
2000
3000
4000
5000
6000
20900 21100 21300 21500Latency us
N(t
)
512 bytes pcatb121-nat-gig6
0
2000
4000
6000
8000
10000
20900 21100 21300 21500Latency us
N(t
)
1400 bytes pcatb121-nat-gig6
0
1000
2000
3000
4000
5000
20900 21100 21300 21500Latency us
N(t
)
The network can sustain 1Gbps of UDP traffic The average server can loose smaller packets Packet loss caused by lack of power in the PC receiving
the traffic Out of order packets due to WAN routers Lightpaths look like extended LANS
have no re-ordering
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
75
tcpdump / tcptrace tcpdump: dump all TCP header information for a specified
source/destination ftp://ftp.ee.lbl.gov/
tcptrace: format tcpdump output for analysis using xplot http://www.tcptrace.org/ NLANR TCP Testrig : Nice wrapper for tcpdump and tcptrace tools
http://www.ncne.nlanr.net/TCP/testrig/
Sample use: tcpdump -s 100 -w /tmp/tcpdump.out host hostname tcptrace -Sl /tmp/tcpdump.out xplot /tmp/a2b_tsg.xpl
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
76
tcptrace and xplot
X axis is time Y axis is sequence number the slope of this curve gives the throughput over time. xplot tool make it easy to zoom in
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
77
Zoomed In View Green Line: ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received. Yellow Ticks track the window advertisements that were the same as the
last advertisement. White Arrows represent segments sent. Red Arrows (R) represent retransmitted segments
5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack ObservatoryR. Hughes-Jones Manchester
78
TCP Slow Start