Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 1 Bandwidth Challenges or "How fast can we...

Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester1

Bandwidth Challenges or

"How fast can we really drive a Network?"

Richard Hughes-Jones The University of Manchester

www.hep.man.ac.uk/~rich/ then “Talks”

http://www.hep.man.ac.uk/~rich/


SCINet

Caltech Booth The BWC at the SLAC Booth

Collaboration at SC|05

ESLEA Boston Ltd. & Peta-CacheSun

Storcloud


Bandwidth Challenge wins Hat Trick The maximum aggregate bandwidth was >151 Gbits/s

130 DVD movies in a minute serve 10,000 MPEG2 HDTV movies

in real-time 22 10Gigabit Ethernet waves

Caltech & SLAC/FERMI booths In 2 hours transferred 95.37 TByte 24 hours moved ~ 475 TBytes

Showed real-time particle event analysis

SLAC Fermi UK Booth: 1 10 Gbit Ethernet to UK NLR&UKLight:

transatlantic HEP disk to diskVLBI streaming

2 10 Gbit Links to SALC:rootd low-latency file access

application for clusters Fibre Channel StorCloud

4 10 Gbit links to FermiDcache data transfers

SLAC-ESnet

FermiLab-HOPI

SLAC-ESnet-USNFNAL-UltraLight

UKLight

SLAC-ESnet

FermiLab-HOPI

SLAC-ESnet-USNFNAL-UltraLight

UKLight

SC2004 101 Gbit/s

In to booth

Out of booth


ESLEA and UKLight

6 * 1 Gbit transatlantic Ethernet layer 2 paths UKLight + NLR

Disk-to-disk transfers with bbcp Seattle to UK Set TCP buffer and application

to give ~850Mbit/s One stream of data 840-620 Mbit/s

Stream UDP VLBI data UK to Seattle 620 Mbit/s

sc0502 SC|05

0

100

200

300

400

500

600

700

800

900

1000

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

date-time

Ra

te

M

bit/s

sc0503 SC|05

0

100

200

300

400

500

600

700

800

900

1000

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

date-time

Ra

te

M

bit/s

sc0504 SC|05

0

100

200

300

400

500

600

700

800

900

1000

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

date-time

Ra

te

M

bit/s

sc0501 SC|05

0

100

200

300

400

500

600

700

800

900

1000

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

time

Ra

te

M

bit/s

UKLight SC|05

0

500

1000

1500

2000

2500

3000

3500

4000

4500

16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00

date-time

Ra

te

Mb

it/s

Reverse TCP


SLAC 10 Gigabit Ethernet 2 Lightpaths:

Routed over ESnet Layer 2 over Ultra Science Net

6 Sun V20Z systems per λ

dcache remote disk data access 100 processes per node Node sends or receives One data stream 20-30 Mbit/s

Used Netweion NICs & Chelsio TOE Data also sent to StorCloud

using fibre channel links

Traffic on the 10 GE link for 2 nodes: 3-4 Gbit per nodes 8.5-9Gbit on Trunk


LightPath Topologies


Switched LightPaths [1] Lightpaths are a fixed point to point path or circuit

Optical links (with FEC) have a BER 10-16 i.e. a packet loss rate 10-12 or 1 loss in about 160 days

In SJ5 LightPaths known as Bandwidth Channels Host to host Lightpath

One Application No congestion Advanced TCP stacks for large Delay Bandwidth Products

Lab to Lab Lightpaths Many application share Classic congestion points TCP stream sharing and recovery Advanced TCP stacks


Switched LightPaths [2]

Some applications suffer when using TCP may prefer to use UDP DCCP XCP …

E.g. With e-VLBI the data wave-front gets distorted and correlation fails

User Controlled Lightpaths Grid Scheduling of

CPUs & Network Many Application flows No congestion on each path Lightweight framing possible


Network Transport Layer Issues


Problem #1

Packet Loss

Is it important ?


0.00010.0010.010.1

110

1001000

10000100000

0 50 100 150 200rtt ms

Tim

e t

o r

eco

ver

sec

10Mbit100Mbit1Gbit10Gbit

TCP (Reno) – Packet loss and Time TCP takes packet loss as indication of congestion Time for TCP to recover its throughput from 1 lost 1500 byte packet given by:

for rtt of ~200 ms @ 1 Gbit/s:

MSS

RTTC

*2

* 2

UK 6 ms Europe 25 ms USA 150 ms1.6 s 26 s 28min

2 min


Packet Loss and new TCP Stacks TCP Response Function

Throughput vs Loss Rate – further to right: faster recovery UKLight London-Chicago-London rtt 177 ms Drop Packets in Kernel 2.6.6 Kernel

Agreement withtheory good

Some new stacksgood at high loss rates

sculcc1-chi-2 iperf 13Jan05

1

10

100

1000

100100010000100000100000010000000100000000Packet drop rate 1 in n

TC

P A

chie

vable

thro

ughput

Mbit/

s

A0 1500

A1 HSTCP

A2 Scalable

A3 HTCP

A5 BICTCP

A8 Westwood

A7 Vegas

A0 Theory

Series10

Scalable Theory

sculcc1-chi-2 iperf 13Jan05

0

100

200

300

400

500

600

700

800

900

1000

100100010000100000100000010000000100000000Packet drop rate 1 in n

TC

P A

chie

vable

thro

ughput

Mbit/

s

A0 1500

A1 HSTCP

A2 Scalable

A3 HTCP

A5 BICTCP

A8 Westwood

A7 Vegas


High Throughput DemonstrationsGeneva rtt 128 ms

man03lon01

2.5 Gbit SDHMB-NG Core

1 GEth1 GEth

Cisco GSRCisco GSRCisco7609

Cisco7609

Chicago

Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz

Send data with TCPDrop Packets

Monitor TCP with Web100


TCP Throughput – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106

High-SpeedRapid recovery

ScalableVery fast recovery

StandardRecovery would

take ~ 20 mins


Problem #2

Is TCP fair?

look at

Round Trip Times & Max Transfer Unit


MTU and Fairness

Two TCP streams share a 1 Gb/s bottleneck RTT=117 ms MTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s MTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s Link utilization : 70,7 %

Starlight (Chi)Starlight (Chi)CERN (GVA)CERN (GVA)

RR RRGbE GbE SwitchSwitch

Host #1Host #1POS 2.5POS 2.5 GbpsGbps1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GE

BottleneckBottleneck

Throughput of two streams with different MTU sizes sharing a 1 Gbps bottleneck

0

100

200

300

400

500

600

700

800

900

1000

0 1000 2000 3000 4000 5000 6000

Time(s)

Thro

ughput

(Mbps)

MTU = 3000 Byte

Average over the life of the connection MTU = 3000 Byte

MTU = 9000 Byte

Average over the life of the connection MTU = 9000 Byte

Sylvain Ravot DataTag 2003


RTT and Fairness

SunnyvaleSunnyvaleStarlight (Chi)Starlight (Chi)

CERN (GVA)CERN (GVA)

RR RRGbE GbE SwitchSwitch

Host #1Host #1

POS 2.5POS 2.5 Gb/sGb/s1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GE

BottleneckBottleneck

RRPOS 10POS 10 Gb/sGb/sRR10GE10GE

Two TCP streams share a 1 Gb/s bottleneck CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s MTU = 9000 bytes Link utilization = 71,6 %

Throughput of two streams with different RTT sharing a 1Gbps bottleneck

0

100

200

300

400

500

600

700

800

900

1000

0 1000 2000 3000 4000 5000 6000 7000

Time (s)

Thro

ughput

(Mbps)

RTT=181ms

Average over the life of the connection RTT=181ms

RTT=117ms

Average over the life of the connection RTT=117ms

Sylvain Ravot DataTag 2003


Problem #n

Do TCP Flows Share the Bandwidth ?


Chose 3 paths from SLAC (California) Caltech (10ms), Univ Florida (80ms), CERN (180ms)

Used iperf/TCP and UDT/UDP to generate traffic

Each run was 16 minutes, in 7 regions

Test of TCP Sharing: Methodology (1Gbit/s)

Ping 1/s

Iperf or UDT

ICMP/ping traffic

TCP/UDPbottleneck

iperf

SLACCERN

2 mins 4 mins Les Cottrell & RHJ PFLDnet 2005


Low performance on fast long distance paths AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) Net effect: recovers slowly, does not effectively use available bandwidth, so poor

throughput Unequal sharing

TCP Reno single stream

Congestion has a dramatic effect

Recovery is slow

Increase recovery rate

SLAC to CERN

RTT increases when achieves best throughput

Les Cottrell & RHJ PFLDnet 2005

Remaining flows do not take up slack when flow removed


Hamilton TCP One of the best performers

Throughput is high Big effects on RTT when achieves best throughput Flows share equally

Appears to need >1 flow toachieve best throughput

Two flows share equally

SLAC-CERN

> 2 flows appears less stable


Problem #n+1

To SACK or not to SACK ?


The SACK Algorithm SACK Rational Non-continuous blocks of data can be ACKed Sender transmits just lost packets Helps when multiple packets lost in one TCP window

The SACK Processing is inefficient for large bandwidth delay products Sender write queue (linked list) walked for:

Each SACK blockTo mark lost packetsTo re-transmit

Processing so long input Q becomes full Get Timeouts

SACKs updated rtt 150msStandard SACKs rtt 150ms

HS-TCP

Dell 1650 2.8 GHz

PCI-X 133 MHz

Intel Pro/1000

Doug Leith

Yee-Ting Li


What does the User & Application make of this?

The view from the Application


SC2004: Disk-Disk bbftp bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:

Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s)

Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s

~4.5s of overhead)

Disk-TCP-Disk at 1Gbit/sis here!

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000

time msT

CP

Ach

ive M

bit

/s

050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd

InstaneousBW

AveBW

CurCwnd (Value)

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000

time ms

TC

PA

ch

ive M

bit

/s

050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd

InstaneousBWAveBWCurCwnd (Value)


SC|05 HEP: Moving data with bbcp What is the end-host doing with your network protocol? Look at the PCI-X 3Ware 9000 controller RAID0 1 Gbit Ethernet link 2.4 GHz dual Xeon ~660 Mbit/s

PCI-X bus with RAID Controller

PCI-X bus with Ethernet NIC

Read from diskfor 44 ms every 100ms

Write to Networkfor 72 ms

Power needed in the end hosts Careful Application design


VLBI: TCP Stack & CPU Load Real User problem! End host TCP flow at 960 Mbit/s with rtt 1 ms falls to 770 Mbit/s when rtt 15 ms

mk5-606-g7_10Dec05

0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00

100.00

0 2 4 6 8 10 12 14 16 18 20nice large value - low priority

% C

PU

mo

de

se

nd

kernel

user

nice

idle

no CPU load

0

200

400

600

800

1000


Thro

ughput

Mbit/s

no CPU load

1.2GHz PIII TCP iperf rtt 1 ms 960 Mbit/s

94.7% kernel mode idle 1.5 % TCP iperf rtt 15 ms 777 Mbit/s

96.3% kernel mode idle 0.05 % CPULoad with nice priority

Throughput falls as priorityincreases

No Loss No Timeouts

Not enough CPU powermk5-606-g7_17Jan05

0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00

100.00


% C

PU

mo

de

se

nd

kernel

user

nice

idle

no CPU load

0

200

400

600

800

1000


Thro

ughput

Mbit/s

no CPU load

2.8 GHz Xeon rtt 1 ms TCP iperf 916 Mbit/s

43% kernel mode idle 55% CPULoad with nice priority

Throughput constant as priority increases

No Loss No Timeouts

Kernel mode includes TCP stackand Ethernet driver


ATLAS Remote Computing: Application Protocol

Event Request EFD requests an event from SFI SFI replies with the event ~2Mbytes

Processing of event Return of computation

EF asks SFO for buffer space SFO sends OK EF transfers results of the computation

tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication.

Send OK

Send event data

Request event

●●●

Request Buffer

Send processed event

Process event

Time

Request-Response time (Histogram)

Event Filter Daemon EFD SFI and SFO


tcpmon: TCP Activity Manc-CERN Req-Resp

0

50000

100000

150000

200000

250000

0 200 400 600 800 1000 1200 1400 1600 1800 2000time

Data

Byte

s O

ut

0

50

100

150

200

250

300

350

400

Data

Byte

s I

n

DataBytesOut (Delta DataBytesIn (Delta Web100 hooks for TCP status

Round trip time 20 ms 64 byte Request green

1 Mbyte Response blue TCP in slow start 1st event takes 19 rtt or ~ 380 ms

0

50000

100000

150000

200000

250000

0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms

Data

Byte

s O

ut

0

50000

100000

150000

200000

250000

Cu

rCw

nd

DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value

TCP Congestion windowgets re-set on each Request

TCP stack RFC 2581 & RFC 2861 reduction of Cwnd after inactivity

Even after 10s, each response takes 13 rtt or ~260 ms

020406080

100120140160180

0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms

TC

PA

ch

ive M

bit

/s

0

50000

100000

150000

200000

250000

Cw

nd

Transfer achievable throughput120 Mbit/s

Event rate very low Application not happy!


tcpmon: TCP Activity Manc-cern Req-Respno cwnd reduction

Round trip time 20 ms 64 byte Request green

1 Mbyte Response blue TCP starts in slow start 1st event takes 19 rtt or ~ 380 ms

0

200000

400000

600000

800000

1000000

1200000

0 500 1000 1500 2000 2500 3000time

Da

ta B

yte

s O

ut

0

50

100

150

200

250

300

350

400

Data

Byte

s I

n

DataBytesOut (Delta DataBytesIn (Delta

0100200300400

500600700800900

0 1000 2000 3000 4000 5000 6000 7000 8000time ms

TC

PA

ch

ive M

bit

/s

0

200000

400000

600000

800000

1000000

1200000

Cw

nd

0

100

200

300

400

500

600

700

800

0 500 1000 1500 2000 2500 3000time ms

nu

m P

ackets

0

200000

400000

600000

800000

1000000

1200000

Cw

nd

PktsOut (Delta PktsIn (Delta CurCwnd (Value

TCP Congestion windowgrows nicely

Response takes 2 rtt after ~1.5s Rate ~10/s (with 50ms wait)

Transfer achievable throughputgrows to 800 Mbit/s

Data transferred WHEN theapplication requires the data

3 Round Trips 2 Round Trips


Objective: demo 1 Gbit/s aggregate bandwidth between RAL and 4 Tier 2 sites RAL has SuperJANET4 and UKLight links: RAL Capped firewall traffic at 800 Mbit/s

SuperJANET Sites: Glasgow Manchester Oxford QMWL

UKLight Site: Lancaster

Many concurrent transfersfrom RAL to each of the Tier 2 sites

HEP: Service Challenge 4

~700 Mbit UKLight

Peak 680 Mbit SJ4

5510 +5530

5530

RouterA

UKLightRouter

3 x 5510+ 5530

5510-3stack

ADS Caches

CPUs +Disks

CPUs +Disks

CPUs +Disks

CPU +Disks

CPUs +Disks

10Gb/ s

4 x1Gb/ s

10Gb/ s

4 x 1Gb/ sto CERN

1Gb/ sto Lancaster

N x 1Gb/ s

N x 1Gb/ s

FW

1Gb/ s 1Gb/ s to SJ 4

RALSite

2 x 1Gb/ s

Tier 1

RALTier 2

10Gb/ s

CPU +Disks

5510-2stack

OracleRACs

5510 +5530

5530

RouterA

UKLightRouter

3 x 5510+ 5530

5510-3stack

ADS Caches

CPUs +Disks

CPUs +Disks

CPUs +Disks

CPU +Disks

CPUs +Disks

10Gb/ s

4 x1Gb/ s

10Gb/ s

4 x 1Gb/ sto CERN

1Gb/ sto Lancaster

N x 1Gb/ s

N x 1Gb/ s

FW

1Gb/ s 1Gb/ s to SJ 4

RALSite

2 x 1Gb/ s

Tier 1

RALTier 2

10Gb/ s

CPU +Disks

5510-2stack

OracleRACs

Applications able to sustain high rates.

SuperJANET5, UKLight &new access links very timely


Well, you CAN fill the Links at 1 and 10 Gbit/s – but its not THAT simple

Packet loss is a killer for TCP Check on campus links & equipment, and access links to backbones Users need to collaborate with the Campus Network Teams Dante Pert

New stacks are stable and give better response & performance Still need to set the TCP buffer sizes ! Check other kernel settings e.g. window-scale maximum Watch for “TCP Stack implementation Enhancements”

TCP tries to be fair Large MTU has an advantage Short distances, small RTT, have an advantage

TCP does not share bandwidth well with other streams

The End Hosts themselves Plenty of CPU power is required for the TCP/IP stack as well and the application Packets can be lost in the IP stack due to lack of processing power / CPU scheduling Interaction between HW, protocol processing, and disk sub-system complex

Application architecture & implementation are also important The TCP protocol dynamics strongly influence the behaviour of the Application.

Summary & Conclusions


More Information Some URLs UKLight web site: http://www.uklight.ac.uk ESLEA web site: http://www.eslea.uklight.ac.uk MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:

http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests:

http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/

TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html

TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004

PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002


Any Questions?


Backup Slides


Lectures, tutorials etc. on TCP/IP: www.nv.cc.va.us/home/joney/tcp_ip.htm www.cs.pdx.edu/~jrb/tcpip.lectures.html www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm www.cis.ohio-state.edu/htbin/rfc/rfc1180.html www.jbmelectronics.com/tcp.htm

Encylopaedia http://www.freesoft.org/CIE/index.htm

TCP/IP Resources www.private.org.il/tcpip_rl.html

Understanding IP addresses http://www.3com.com/solutions/en_US/ncs/501302.html

Configuring TCP (RFC 1122) ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt

Assigned protocols, ports etc (RFC 1010) http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols

More Information Some URLs 2

http://www.nv.cc.va.us/home/joney/tcp_ip.htm

http://www.cs.pdx.edu/~jrb/tcpip.lectures.html

http://www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS

http://www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm

http://www.cis.ohio-state.edu/htbin/rfc/rfc1180.html

http://www.jbmelectronics.com/tcp.htm

http://www.freesoft.org/CIE/index.htm

http://www.freesoft.org/CIE/index.htm

http://www.private.org.il/tcpip_rl.html

http://www.3com.com/solutions/en_US/ncs/501302.html

ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt

http://www.es.net/pub/rfcs/rfc1010.txt


Packet Loss with new TCP Stacks TCP Response Function

Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel

MB-NG rtt 6ms DataTAG rtt 120 ms


Drop 1 in 25,000 rtt 6.2 ms Recover in 1.6 s

High Performance TCP – MB-NG

Standard HighSpeed Scalable


Fast

As well as packet loss, FAST uses RTT to detect congestion RTT is very stable: σ(RTT) ~ 9ms vs 37±0.14ms for the others

SLAC-CERN

Big drops in throughput which take several seconds to recover from

2nd flow never gets equal share of bandwidth


SACK …

Look into what’s happening at the algorithmic level with web100:

Strange hiccups in cwnd only correlation is SACK arrivals

Scalable TCP on MB-NG with 200mbit/sec CBR Background

Yee-Ting Li


10 Gigabit Ethernet: Tuning PCI-X 16080 byte packets every 200 µs Intel PRO/10GbE LR Adapter PCI-X bus occupancy vs mmrbc

Measured times Times based on PCI-X times from

the logic analyser Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s

0

5

10

15

20

25

30

35

40

45

50

0 1000 2000 3000 4000 5000Max Memory Read Byte Count

PC

I-X

Tra

nsfe

r tim

e u

s

0

1

2

3

4

5

6

7

8

9

PC

I-X

Tra

nsfe

r ra

te G

bit/s

Measured PCI-X transfer time usexpected time usrate from expected time Gbit/s Max throughput PCI-X

Kernel 2.6.1#17 HP Itanium Intel10GE Feb04

0

2

4

6

8

10

0 1000 2000 3000 4000 5000Max Memory Read Byte Count

PC

I-X

Tra

nsfe

r tim

e

us

measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X

mmrbc1024 bytes

mmrbc2048 bytes

mmrbc4096 bytes5.7Gbit/s

mmrbc512 bytes

CSR Access

PCI-X Sequence

Data Transfer

Interrupt & CSR Update

Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 1 Bandwidth Challenges or "How fast can we...

Documents

Transcript of Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester 1 Bandwidth Challenges or "How fast can we...