Download - Transport Level Protocol Performance Evaluation for Bulk Data Transfers Matei Ripeanu The University of Chicago matei/ Abstract:

Transport Level Protocol Performance Evaluation for Bulk Data TransfersMatei Ripeanu

The University of Chicagohttp://www.cs.uchicago.edu/~matei/

Abstract: Before developing new protocols targeted at bulk data transfers, the achievable performance and limitations of the broadly used TCP protocol should be carefully investigated. Our first goal is to explore TCP's bulk transfer throughput as a function of network path properties, number of concurrent flows, loss rates, competing traffic, etc. We use analytical models, simulations, and real-world experiments. The second objective is to repeat this evaluation for some of TCP's replacement candidates (e.g. NETBLT). This should allow an informed decision whether (or not) to put effort into developing and/or using new protocols specialized on bulk transfers.

Application requirements (GriPhyN): Efficient management of 10s to 100s of

PetaBytes (PB) of data, many PBs of new raw data / year.

Granularity: file size 10M to 1G. Large-pipes: OC3 and up, high

latencies Efficient bulk data transfers Gracefully share with other applns. Projects: CMS, ATLAS, LIGO, SDSS

Stable-state throughput as % of bottleneck link rate (RTT=80ms, MSS=1460bytes)

0%

20%

40%

60%

80%

100%

1.E

-02

5.E

-03

2.E

-03

1.E

-03

5.E

-04

2.E

-04

1.E

-04

5.E

-05

2.E

-05

1.E

-05

5.E

-06

2.E

-06

1.E

-06

5.E

-07

2.E

-07

1.E

-07

5.E

-08

2.E

-08

1.E

-08

Loss indication rates (log scale)

% o

f bo

ttlen

eck

link

rate

(%

).

OC3 link (155Mbps)

OC12 link (622 Mbps)

T3 link (43.2Mbps)

(Rough) analytical stable-state throughput estimates (based on [Math96]

2maxmax

max

2max

38

81

1*

38*

Wpfor

pW

WRTT

MSS

Wpfor

p

C

RTT

MSS

Throughput

Main inefficiencies TCP is blamed for:Overhead. However, less than 15% of time spent

in proper TCP processing.Flow control. Claim: a rate-based protocol would

be faster. However, there is no proof that this is better than (self) ACK-clocking.

Congestion control: • Underlying problem: underlying layers do not

give explicit congestion feedback, TCP therefore assumes any packet loss is a congestion signal

• Not scalable.

Questions: Is TCP appropriate/usable? What about rate based protocols?

Want to optimize: Link utilization Per file transfer delay While maintaining “fair” sharing

TCP Refresher:

Time

Slow Start (exponential growth)

Congestion Avoidance (linear growth)

Fast retransmit

Packet loss discovered through fast recovery mechanism

Packet loss discovered through timeout

Simulations (using NS []): Simulation topology:

Significant throughput improvements can be achieved just by tuning the end-systems and the network path: set up proper window-sizes, disable delayed ACK, use SACK and ECN, use jumbo frames, etc.

For high link loss rates, striping is a legitimate and effective solution.

1

10

100

1000

1.E

-02

5.E

-03

2.E

-03

1.E

-03

5.E

-04

2.E

-04

1.E

-04

5.E

-05

2.E

-05

1.E

-05

5.E

-06

2.E

-06

1.E

-06

5.E

-07

2.E

-07

1.E

-07

5.E

-08

2.E

-08

1.E

-08

Link loss rate (log scale)

100M

B t

rans

fer

tim

e (s

ec)

(log

sca

le).

MSS=1460, DelAck, huge WS

MSS=1460, DelAck, WS ok

MSS=1460, WS ok

MSS=9000

MSS=9000, FACK

MSS=9000, FACK, 5 flows

Ideal

10

100

1000

1.E

-02

5.E

-03

2.E

-03

1.E

-03

5.E

-04

2.E

-04

1.E

-04

5.E

-05

2.E

-05

1.E

-05

5.E

-06

2.E

-06

1.E

-06

5.E

-07

2.E

-07

1.E

-07

5.E

-08

2.E

-08

1.E

-08

Link loss rate (log scale)

1GB

tran

sfer

tim

e (s

ec)

(log

sca

le)

MSS=1460, DelAck

MSS=1460, WS ok

MSS=9000

FACK

5 flows

25 flows

Ideal

OC3 link, 80ms RTT, MSS=1460 initially

OC12 link, 100ms RTT, MSS=1460 initially

1Gbps, 1ms RTT links

1Gbps, 1ms RTT links

OC3, 35ms or OC12, 45ms

0

1000

2000

3000

4000

5000

6000

0

100

200

300

400

500

600

700

800

900

1000

Number of parallel flows used

Pac

kets

dro

pped

0

2

4

6

8

10

12

Tra

nsfe

r ti

me

(StD

ev)

StDev (right scale)

Dropped messages(left scale)

0

5

10

15

20

25

30

35

40

45

50

0

100

200

300

400

500

600

700

800

900

1000


Tim

e (s

ec) 10 flows

20 flows

50 flows

100 flows

200 flows

300 flows

400 flows

500 flows

600 flows

700 flows

800 flows

900 flows

1000 flows

0

5

10

15

20

25

30

35

40

45

50

0

100

200

300

400

500

600

700

800

900

1000

Number of parallel flows (stripes) used

Tim

e (s

ec)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

100

200

300

400

500

600

700

800

900

1000


Dro

pped

pac

kets

.

0

2

4

6

8

10

12

Tra

nsfe

r ti

me

(Std

Dev

)

Packets Dropped (leftscale)

StdDev (right scale)

TCP striping issues Widespread usage exposes

scaling problems in TCP congestion control mechanism:• Unfair allocation: a small

number of flows grabs almost all available bandwidth

• Reduced efficiency: a large number of packets are dropped.

• Rule of thumb: have less flows in the systems than ‘pipe size’ expressed in packets

Not ‘TCP unfriendly’ as long as link loss rates are high

Even high link loss rates do not break unfairness

0.5GB striped transfer, OC3 link (155Mbps), RTT80ms, MSS=9000 using up to 1000 flows

Loss rate=0.1%Loss rate=0ConclusionsTCP can work well with careful end-host and

network tuningFor fair sharing with other users, need

mechanisms to provide congestion feedback and distinguish genuine link losses from congestion indications.

In addition, admission mechanisms based on the number of parallel flows might be beneficial

GridFTP and iperf Performance(between LBNL and ANL via ES-Net)

0

50

100

150

200

250

300

350

0 5 10 15 20 25 30 35

# of TCP streams

Band

wid

th (M

bs)

GridFTP iperf

Striping Widely used (browsers, ftp, etc) Good practical results Not ‘TCP friendly’!

•RFC2140/ Ensemble TCP – share information and congestion management among parallel flows

MCS/ANL courtesy

Future workWhat are optimal buffer sizes for bulk

transfers? Can we use ECN and large buffers to reliably

detect congestion without using dropped packets as a congestion indicator?

Assuming the link loss rate pattern is known, can it be used to reliably detect congestion and improve throughput and

OC12, ANL to LBNL (56ms), Linux boxes

http://www.griphyn.org/index.html