Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni...
-
Upload
anthony-pitts -
Category
Documents
-
view
220 -
download
0
Transcript of Internet data transfer record between CERN and California Sylvain Ravot (Caltech) Paolo Moroni...
Internet data transfer Internet data transfer record between CERN and record between CERN and
CaliforniaCalifornia
Sylvain Ravot (Caltech)Sylvain Ravot (Caltech)
Paolo Moroni (CERN)Paolo Moroni (CERN)
12 May 2003Ravot - Moroni
(Slide 2)EOF RIPE-45 meeting - Barcelona
SummarySummary
Internet2 Land Speed Record Internet2 Land Speed Record ContestContest
New LSRNew LSR DataTAG project and network DataTAG project and network
configurationconfiguration Establishing an LSR: hardware and Establishing an LSR: hardware and
tuningtuning ConclusionsConclusions AcknowledgementsAcknowledgements
12 May 2003Ravot - Moroni
(Slide 3)EOF RIPE-45 meeting - Barcelona
Internet2 LSR Contest Internet2 LSR Contest (I)(I)
From http://lsr.internet2.edu:From http://lsr.internet2.edu:
““A minimum of 100 megabytes must be A minimum of 100 megabytes must be transferred a minimum terrestrial distance transferred a minimum terrestrial distance of 100 kilometers with a minimum of two of 100 kilometers with a minimum of two router hops in each direction between the router hops in each direction between the source node and the destination node source node and the destination node across one or more operational and across one or more operational and production-oriented high-performance production-oriented high-performance research and education networks.” research and education networks.”
“The contest unit of measurement is […] “The contest unit of measurement is […] bit-meters/second.”bit-meters/second.”
12 May 2003Ravot - Moroni
(Slide 4)EOF RIPE-45 meeting - Barcelona
Internet2 LSR Contest Internet2 LSR Contest (II)(II)
““Instances of all hardware units and software Instances of all hardware units and software modules used to transfer contest data on the modules used to transfer contest data on the source node, the destination node, the links, source node, the destination node, the links, and the routers must be offered for and the routers must be offered for commercial sale or as open source software to commercial sale or as open source software to all U.S. members of the Internet community by all U.S. members of the Internet community by their respective vendors or developers prior to their respective vendors or developers prior to or immediately after winning the contest.”or immediately after winning the contest.”
Award classes: single or multiple TCP streams, Award classes: single or multiple TCP streams, on top of IPv4 or IPv6on top of IPv4 or IPv6
Generally available networks and equipment, Generally available networks and equipment, vs. lab prototypesvs. lab prototypes
12 May 2003Ravot - Moroni
(Slide 5)EOF RIPE-45 meeting - Barcelona
Former LSRFormer LSR
TCP/IPv4 single streamTCP/IPv4 single stream By NIKHEF, Caltech and SLACBy NIKHEF, Caltech and SLAC Established on November 19Established on November 19thth 2002 2002 10978 Km of network: Geneva-10978 Km of network: Geneva-
Amsterdam-Chicago-SunnyvaleAmsterdam-Chicago-Sunnyvale 0.93 Gb/sec0.93 Gb/sec 10136.15 terabit-meters/second10136.15 terabit-meters/second
12 May 2003Ravot - Moroni
(Slide 6)EOF RIPE-45 meeting - Barcelona
Current LSRCurrent LSR
TCP/IPv4 single streamTCP/IPv4 single stream
By Caltech, CERN, LANL and SLAC, within the By Caltech, CERN, LANL and SLAC, within the DataTAGDataTAG project framework project framework
Established on February 27Established on February 27thth-28-28thth 2003 by 2003 by Sylvain Ravot (Caltech) using IPERFSylvain Ravot (Caltech) using IPERF
10037 Km of network: Geneva-Chicago-10037 Km of network: Geneva-Chicago-Sunnyvale Sunnyvale (shorter distance than the former (shorter distance than the former LSR)LSR)
2.38 Gb/sec (2.38 Gb/sec (sustainedsustained: a Terabyte of data : a Terabyte of data moved in an hour)moved in an hour)
23888 Terabit-meters/second23888 Terabit-meters/second
12 May 2003Ravot - Moroni
(Slide 7)EOF RIPE-45 meeting - Barcelona
DataTAG projectDataTAG project Full project title: Full project title: ““Research and technological development Research and technological development
for a transatlantic GRIDfor a transatlantic GRID” ”
IST project (EU funded), supported by the NSF and the IST project (EU funded), supported by the NSF and the DoE (Caltech): DoE (Caltech): http://www.datatag.orghttp://www.datatag.org
Partners: PPARC (UK), INRIA (FR), University of Partners: PPARC (UK), INRIA (FR), University of Amsterdam (NL), INFN (IT) and CERN (CH)Amsterdam (NL), INFN (IT) and CERN (CH)
Researchers also from Caltech, Los Alamos, SLAC and Researchers also from Caltech, Los Alamos, SLAC and CanadaCanada
Test-bed kernel: transatlantic STM-16 (T-systems) Test-bed kernel: transatlantic STM-16 (T-systems) between Geneva (CERN) and Chicago (StarLight), with between Geneva (CERN) and Chicago (StarLight), with interconnected workstations at each side.interconnected workstations at each side.
Test-bed extensions provided by GEANT, SURFnet, VTHD Test-bed extensions provided by GEANT, SURFnet, VTHD and other partners, in Europe and North Americaand other partners, in Europe and North America
12 May 2003Ravot - Moroni
(Slide 8)EOF RIPE-45 meeting - Barcelona
DataTAG as test-bed DataTAG as test-bed for LSRfor LSR
Research on TCP as part of the DataTAG Research on TCP as part of the DataTAG programmeprogramme
The DataTAG test-bed was the main The DataTAG test-bed was the main environment for the LSRenvironment for the LSR
Network extension kindly made available by Network extension kindly made available by Level3 Communications, Inc.: Chicago-Level3 Communications, Inc.: Chicago-Sunnyvale STM-64Sunnyvale STM-64
Router at Sunnyvale kindly lent by CiscoRouter at Sunnyvale kindly lent by Cisco
PC 10 GbE interfaces kindly made available PC 10 GbE interfaces kindly made available as pre-release product by Intelas pre-release product by Intel
12 May 2003Ravot - Moroni
(Slide 9)EOF RIPE-45 meeting - Barcelona
StarLightChicago (Illinois – USA)
CERNGeneva (Switzerland)
Level3 PoPSUNNYVALE (California – USA)
LSR network configuration STM-64 STM-64 (Level3 loan)(Level3 loan)
STM-16 STM-16 (T-(T-systems)systems) 10 GbE10 GbE
Cisco 7606Cisco 7606
Cisco 7609Cisco 7609
Juniper T640Juniper T640(TeraGrid)(TeraGrid)
Cisco 12406Cisco 12406(Cisco loan)(Cisco loan) DataTAGDataTAG networknetwork
PC (10GbE)PC (10GbE)(Intel loan)(Intel loan)
PC (10GbE)PC (10GbE)(Intel loan)(Intel loan)
12 May 2003Ravot - Moroni
(Slide 10)EOF RIPE-45 meeting - Barcelona
Establishing an LSR: Establishing an LSR: hardware (I)hardware (I)
No LSR without good hardwareNo LSR without good hardware
A lot of bandwidth: minimum 2.5 Gb/sec on A lot of bandwidth: minimum 2.5 Gb/sec on the whole path (thanks to Level3 for the the whole path (thanks to Level3 for the STM-64 on loan between Chicago and STM-64 on loan between Chicago and Sunnyvale)Sunnyvale)
Powerful routers (Cisco 7600 and GSR, Powerful routers (Cisco 7600 and GSR, Juniper T640)Juniper T640)
Powerful Linux PCs on both sidesPowerful Linux PCs on both sides
Intel 10 GbE interfacesIntel 10 GbE interfaces
12 May 2003Ravot - Moroni
(Slide 11)EOF RIPE-45 meeting - Barcelona
Establishing an LSR: Establishing an LSR: hardware (II)hardware (II)
Linux PC at CERN:Linux PC at CERN: Dual Intel® Xeon™ processors, 2.40GHz with 512K L2 cache Dual Intel® Xeon™ processors, 2.40GHz with 512K L2 cache SuperMicro P4DP8-G2 Motherboard SuperMicro P4DP8-G2 Motherboard Intel E700 chipset Intel E700 chipset 2 GB RAM,PC2100 ECC Reg. DDR 2 GB RAM,PC2100 ECC Reg. DDR Hard drive: 1 x 140 GB - Maxtor ATA-133 Hard drive: 1 x 140 GB - Maxtor ATA-133 On board Intel 82546EB dual port Gigabit Ethernet controller On board Intel 82546EB dual port Gigabit Ethernet controller 4U Rack-mounted server4U Rack-mounted server
Linux PC at Sunnyvale:Linux PC at Sunnyvale: Dual Intel® Xeon™ processors , 2.40GHz with 512K L2 cache Dual Intel® Xeon™ processors , 2.40GHz with 512K L2 cache SuperMicro P4DPE-G2 Motherboard SuperMicro P4DPE-G2 Motherboard 2 GB RAM, PC2100 ECC Reg. DDR 2 GB RAM, PC2100 ECC Reg. DDR 2* 3ware 7500-8 RAID controllers 2* 3ware 7500-8 RAID controllers 16 Western Digital IDE disk drives for RAID and 1 for system 16 Western Digital IDE disk drives for RAID and 1 for system 2 Intel 82550 fast Ethernet 2 Intel 82550 fast Ethernet 2*SysKonnect Gigabit Ethernet card SK-9843 SK-NET GE SX 2*SysKonnect Gigabit Ethernet card SK-9843 SK-NET GE SX 4U Rack-mounted server 4U Rack-mounted server 480W to run 600W to spin up480W to run 600W to spin up
12 May 2003Ravot - Moroni
(Slide 12)EOF RIPE-45 meeting - Barcelona
Establishing an LSR: Establishing an LSR: hardware (III)hardware (III)
Intel 10 GbE interfaces: Intel Pro/10 Intel 10 GbE interfaces: Intel Pro/10 GbE-LRGbE-LR
Not yet commercially available when Not yet commercially available when the LSR was set, but announced as the LSR was set, but announced as commercially available shortly commercially available shortly afterwardsafterwards
12 May 2003Ravot - Moroni
(Slide 13)EOF RIPE-45 meeting - Barcelona
Establishing an LSR: Establishing an LSR: standard tuningstandard tuning
MTU set to 9000 bytesMTU set to 9000 bytes
TCP window size increased from the TCP window size increased from the Linux default of 64K: essential over Linux default of 64K: essential over long distancelong distance
But standard Linux kernel (2.4.20)But standard Linux kernel (2.4.20)
Standard tuning is not enough for Standard tuning is not enough for LSR: why?LSR: why?
12 May 2003Ravot - Moroni
(Slide 14)EOF RIPE-45 meeting - Barcelona
TCP WAN problemsTCP WAN problems
RResponsiveness to packet losses is esponsiveness to packet losses is proportional to the square of the RTT: proportional to the square of the RTT: R=C*(RTT**2)/2*MSS (where C is the link R=C*(RTT**2)/2*MSS (where C is the link capacity and MSS is the max segment size). capacity and MSS is the max segment size). This makes it very difficult to take advantage This makes it very difficult to take advantage of full capacity over long-distance WAN: not a of full capacity over long-distance WAN: not a real problem for standard traffic on a shared real problem for standard traffic on a shared link, but a serious penalty for LSRlink, but a serious penalty for LSR
Slow start mode is “too” slow using default Slow start mode is “too” slow using default parameters: they are good for standard parameters: they are good for standard traffic, but not for LSRtraffic, but not for LSR
12 May 2003Ravot - Moroni
(Slide 15)EOF RIPE-45 meeting - Barcelona
Example: recovering Example: recovering from a packet lossfrom a packet loss
TCP Throughput CERN-Chicago over the 622 Mbit/s link
0
50
100
150
200
0 200 400 600 800 1000 1200 1400 1600
Time (s)
Th
rou
gh
pu
t (M
bit
/s)
TCP reactivity TCP reactivity Time to increase the throughput by 120 Mbit/s
is larger than 6 min for a connection between Chicago and CERN.
Packet losses is a disaster for the Packet losses is a disaster for the overall throughputoverall throughput
6 min6 min
12 May 2003Ravot - Moroni
(Slide 16)EOF RIPE-45 meeting - Barcelona
Cwnd average of the last 10 samples.
Cwnd average over the life of the connection to that point
Slow start Congestion Avoidance
SSTHRESH
Example: slow startExample: slow start vs. congestion avoidance vs. congestion avoidance
12 May 2003Ravot - Moroni
(Slide 17)EOF RIPE-45 meeting - Barcelona
Establishing an LSR: Establishing an LSR: non-standard tuningnon-standard tuning
The TCP stack was designed a long time ago and The TCP stack was designed a long time ago and for much slower networks: looking at the for much slower networks: looking at the previous page’s formula, if C is very small, it previous page’s formula, if C is very small, it keeps the responsiveness low enough for any keeps the responsiveness low enough for any terrestrial RTT: modern, fast WAN links are terrestrial RTT: modern, fast WAN links are “bad” for TCP performance“bad” for TCP performance
TCP tries to increase its window size until TCP tries to increase its window size until something breaks (packet loss, congestion, …); something breaks (packet loss, congestion, …); then restarts from a half of the previous value then restarts from a half of the previous value until it breaks again. This gradual approximation until it breaks again. This gradual approximation process takes very long over long distance and process takes very long over long distance and degrades performancedegrades performance
12 May 2003Ravot - Moroni
(Slide 18)EOF RIPE-45 meeting - Barcelona
Establishing an LSR: Establishing an LSR: further tuningfurther tuning
Knowing a priori the available bandwidthKnowing a priori the available bandwidth, , prevent TCP from trying larger windows by prevent TCP from trying larger windows by restricting the amount of buffers it may use: restricting the amount of buffers it may use: without buffers, it won’t try to use larger without buffers, it won’t try to use larger windows and packet losses can be avoidedwindows and packet losses can be avoided
The product C*RTT yields the optimal TCP The product C*RTT yields the optimal TCP window size for a link of capacity Cwindow size for a link of capacity C
So, allocate just enough buffers to let TCP So, allocate just enough buffers to let TCP squeeze the maximum performance from squeeze the maximum performance from the existing bandwidth the existing bandwidth and nothing elseand nothing else
12 May 2003Ravot - Moroni
(Slide 19)EOF RIPE-45 meeting - Barcelona
Further tuning: Linux Further tuning: Linux implementationimplementation
Tuning TCP buffers (numbers for STM-16):Tuning TCP buffers (numbers for STM-16): echo “4096 87380 128388607” > echo “4096 87380 128388607” >
/proc/sys/net/ipv4/tcp_rmem /proc/sys/net/ipv4/tcp_rmem
echo “4096 65530 128388607” > echo “4096 65530 128388607” > /proc/sys/net/ipv4/tcp_wmem /proc/sys/net/ipv4/tcp_wmem
echo 128388607 > /proc/sys/net/core/wmem_max echo 128388607 > /proc/sys/net/core/wmem_max
echo 128388607 > /proc/sys/net/core/rmem_maxecho 128388607 > /proc/sys/net/core/rmem_max
Tuning the network device buffers:Tuning the network device buffers: /sbin/ifconfig eth1 txqueuelen 10000 /sbin/ifconfig eth1 txqueuelen 10000
/sbin/ifconfig eth1 mtu 9000/sbin/ifconfig eth1 mtu 9000
Both on sender and receiverBoth on sender and receiver
12 May 2003Ravot - Moroni
(Slide 20)EOF RIPE-45 meeting - Barcelona
Even further tuning: Even further tuning: Linux implementationLinux implementation
TCP slow start mode vs. congestion avoidance TCP slow start mode vs. congestion avoidance mode is another performance penalty in the mode is another performance penalty in the sender for the LSRsender for the LSR
On Linux (sender side only):On Linux (sender side only): sysctl -w net.ipv4.route.flush=1sysctl -w net.ipv4.route.flush=1
This prevents TCP from using any previously This prevents TCP from using any previously cached window value, i.e. speeds up slow start cached window value, i.e. speeds up slow start mode and gets to congestion avoidance mode mode and gets to congestion avoidance mode at exponential speed (instead of stopping the at exponential speed (instead of stopping the exponential growth of the congestion window exponential growth of the congestion window at half of some previously cached value)at half of some previously cached value)
12 May 2003Ravot - Moroni
(Slide 21)EOF RIPE-45 meeting - Barcelona
IPERF parametersIPERF parameters
On the sender:On the sender: iperf -c 192.91.239.213 -i 5 -P 3 -w 40M -iperf -c 192.91.239.213 -i 5 -P 3 -w 40M -
t 180t 180
On the receiver:On the receiver: iperf-1.6.5 -s -w 128Miperf-1.6.5 -s -w 128M
IPERF is available at IPERF is available at http://dast.nlanr.nethttp://dast.nlanr.net
12 May 2003Ravot - Moroni
(Slide 22)EOF RIPE-45 meeting - Barcelona
Conclusions: how useful in Conclusions: how useful in practice?practice?
The LSR result cannot be immediately The LSR result cannot be immediately translated into practical general-purpose translated into practical general-purpose recommendations: it relies onrecommendations: it relies on
some some a prioria priori knowledge (the physical link speed) knowledge (the physical link speed) dedicated bandwidthdedicated bandwidth ad hocad hoc TCP tuning: good for LSR, not for general- TCP tuning: good for LSR, not for general-
purpose trafficpurpose traffic
Nevertheless, work is ongoing for a more Nevertheless, work is ongoing for a more modern TCP stack: the new LSR modern TCP stack: the new LSR demonstrates that fast WAN TCP is possible demonstrates that fast WAN TCP is possible in practice, by tweaking TCP a bitin practice, by tweaking TCP a bit
12 May 2003Ravot - Moroni
(Slide 23)EOF RIPE-45 meeting - Barcelona
Conclusions: sustained Conclusions: sustained throughputthroughput
As stated by Internet2, the LSR definition has As stated by Internet2, the LSR definition has only very limited provisioning for requiring only very limited provisioning for requiring sustained throughput (100 Megabytes are not sustained throughput (100 Megabytes are not much for today’s network bandwidth)much for today’s network bandwidth)
However, apart from the achieved throughput, However, apart from the achieved throughput, this LSR shows that high this LSR shows that high sustainedsustained throughput is throughput is in principle possible with TCP over long distancein principle possible with TCP over long distance
This is new with respect to former results, where This is new with respect to former results, where the throughput could be sustained only for 40-60 the throughput could be sustained only for 40-60 seconds, before some TCP feedback mechanism seconds, before some TCP feedback mechanism kicked in and ruined the performancekicked in and ruined the performance
12 May 2003Ravot - Moroni
(Slide 24)EOF RIPE-45 meeting - Barcelona
Other remarksOther remarks
A big difference with respect to the past is that A big difference with respect to the past is that the bottleneck for things like the LSR is now in the bottleneck for things like the LSR is now in the end hosts: no non-trivial tuning was needed the end hosts: no non-trivial tuning was needed on the network where the LSR was establishedon the network where the LSR was established
Incidentally, although single-stream, the new Incidentally, although single-stream, the new LSR was also good enough to establish the new LSR was also good enough to establish the new LSR for multiple IPv4 streamsLSR for multiple IPv4 streams
No TCP packet was lost during the LSR trial No TCP packet was lost during the LSR trial window window
Details of the new record are not published on Details of the new record are not published on http://lsr.internet2.edu yethttp://lsr.internet2.edu yet
12 May 2003Ravot - Moroni
(Slide 25)EOF RIPE-45 meeting - Barcelona
Acknowledgements: Acknowledgements: peoplepeople
Sylvain Ravot (working for Caltech Sylvain Ravot (working for Caltech at CERN)at CERN)
Wu-Chun Feng (LANL)Wu-Chun Feng (LANL)
Les Cottrell (SLAC)Les Cottrell (SLAC)
12 May 2003Ravot - Moroni
(Slide 26)EOF RIPE-45 meeting - Barcelona
Acknowledgements: Acknowledgements: industrial partnersindustrial partners
Acknowledgements: Acknowledgements: organisationsorganisations
Thank youThank you