For FAST Meeting July.2 Towards Gigabit David Wei Netlab@Caltech.

60
For FAST Meeting July.2 Towards Gigabit Towards Gigabit David Wei David Wei Netlab@Caltech Netlab@Caltech
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of For FAST Meeting July.2 Towards Gigabit David Wei Netlab@Caltech.

For FAST Meeting July.2

Towards GigabitTowards Gigabit

David WeiDavid WeiNetlab@CaltechNetlab@Caltech

Potential ProblemsPotential Problems

Hardware / Driver / OSHardware / Driver / OS Protocol Stack OverheadProtocol Stack Overhead Scalability of the protocol specificatiScalability of the protocol specificati

onon TCP Stability /Utilization (New CongTCP Stability /Utilization (New Cong

estion Control Algorithm)estion Control Algorithm) Related Experiments & MeasuremenRelated Experiments & Measuremen

tsts

Hardware / Drivers /OSHardware / Drivers /OS NIC DriverNIC Driver Device ManagemeDevice Manageme

nt (Interrupt)nt (Interrupt) Redundant CopiesRedundant Copies

Device Polling (httDevice Polling (http://info.iet.unipi.it/p://info.iet.unipi.it/~luigi/polling/)~luigi/polling/)

Zero-Copy TCPZero-Copy TCP ……

www.cs.duke.edu/ari/publications/talks/freebsdconwww.cs.duke.edu/ari/publications/talks/freebsdcon

Device PollingDevice Polling Current process for NIC driver in FreeBSD:Current process for NIC driver in FreeBSD:1.1. Packet come to NICPacket come to NIC2.2. NIC->Hardware InterruptNIC->Hardware Interrupt3.3. CPU jumps to the interrupt handler for that NICCPU jumps to the interrupt handler for that NIC4.4. MAC layer process reads data from NIC to a MAC layer process reads data from NIC to a

queuequeue5.5. Upper layer process the data in queue (lower Upper layer process the data in queue (lower

priority)priority) Drawback:Drawback:CPU checks the NIC for every packet -- Context CPU checks the NIC for every packet -- Context

switching.switching.Frequent interruption for high speed deviceFrequent interruption for high speed device Live-Lock:Live-Lock:CPU is too busy working on NIC interruption to CPU is too busy working on NIC interruption to

process the data in the queue. process the data in the queue.

Device PollingDevice Polling

Device Polling:Device Polling: Polling: CPU checks the device when it Polling: CPU checks the device when it

has time. has time. Scheduling: User specifies a time ratio Scheduling: User specifies a time ratio

for CPU to work on devices and on for CPU to work on devices and on non-device processing.non-device processing.

Advantages:Advantages: Balance between the device service and Balance between the device service and

non-device processingnon-device processing Improve performance in fast devicesImprove performance in fast devices

Protocol Stack OverheadProtocol Stack Overhead

Per-packet over head:Per-packet over head: Ethernet Header / ChecksumEthernet Header / Checksum IP Header / ChecksumIP Header / Checksum TCP Header / ChecksumTCP Header / Checksum Coping / interruption processCoping / interruption process

Solution: Increase packet sizeSolution: Increase packet size Opt Packet Size=min{ packet size Opt Packet Size=min{ packet size

along the path} (Fragmentation results along the path} (Fragmentation results in low performance too.)in low performance too.)

Path MTU Discovery Path MTU Discovery (1191)(1191)

Current Method:Current Method: ““Don’t Fragment” bits Don’t Fragment” bits (Router: Drop/Fragment; Host: (Router: Drop/Fragment; Host:

Test/Enforce)Test/Enforce) MTU=min{576, first hop MTU}MTU=min{576, first hop MTU} MSS=MTU-40MSS=MTU-40 MTU<=65535 (Architecture)MTU<=65535 (Architecture) MSS<=65495 (IP sign-bit bugs…)MSS<=65495 (IP sign-bit bugs…) Drawback: Usually too smallDrawback: Usually too small

Path MTU DiscoveryPath MTU Discovery How to Discover PMTU?How to Discover PMTU?

Current:Current: Search (Proportional Decreasing / Search (Proportional Decreasing /

Binary)Binary) Update (Periodically Increasing – set to Update (Periodically Increasing – set to

the MTU of first hop)the MTU of first hop)

Proposed:Proposed: Search/Update with typical MTU valuesSearch/Update with typical MTU values Routers: provide suggestion of MTU in Routers: provide suggestion of MTU in

DTB indicating the DF pack drop.DTB indicating the DF pack drop.

Path MTU DiscoveryPath MTU DiscoveryImplementationImplementationHost:Host: Packetization Layer (TCP / Connection over UDPacketization Layer (TCP / Connection over UD

P): DF/Packet SizeP): DF/Packet Size IP: Store PMTU for each known path (routing taIP: Store PMTU for each known path (routing ta

ble)ble) ICMP: “Datagram Too Big” MessageICMP: “Datagram Too Big” MessageRouter:Router: Send ICMP Packet when Datagram is too big.Send ICMP Packet when Datagram is too big.Implementation problems:Implementation problems: RFC 2923RFC 2923

Scalability of Protocol Scalability of Protocol SpecificationsSpecifications

Windows Size Space (<=64K)Windows Size Space (<=64K) Sequence Number Space (Wrapping Sequence Number Space (Wrapping

up, <=2G)up, <=2G) Inadequate Frequency of RTT Inadequate Frequency of RTT

Sampling (1 sample per Window)Sampling (1 sample per Window)

3 4 51 2

3 4 51 2

Sequence Number SpaceSequence Number Space

1

Sequence Number SpaceSequence Number Space

3 4 51 2

3 4 51 2

ACK:

1

ACK:

1

ACK:

1

Sequence Number SpaceSequence Number Space

3 4 51 2

3 4 51 2

ACK:

1

ACK:

1

ACK:

1

Sequence Number SpaceSequence Number Space

3 4 51 2

3 4 51 2

ACK:

1

ACK:

1

ACK:

1

ACK:

6

Sequence Number SpaceSequence Number Space

3 4 5 6 7. . 1 2

6 7 . .

ACK:

7

ACK:

6

Sequence Number SpaceSequence Number Space

6 7. . . . 0

6 7 . .. . 0

ACK:

7

ACK:

6

ACK:

1

Sequence Number SpaceSequence Number Space

6 7. . . . 0 ??

6 7 . .. . 0

ACK:

7

ACK:

6

ACK:

1

Sequence Number SpaceSequence Number Space

6 7. . . . 0 ??

6 7 . .. . 0

ACK:

7

ACK:

6

ACK:

1Accept when the

del ay <=MaxSegment Li fe

Sequence Number SpaceSequence Number Space

MSL (Max Segment Life)>Variance of IP MSL (Max Segment Life)>Variance of IP delaydelay

MSL<Sequence Number Space/BandwidthMSL<Sequence Number Space/Bandwidth

6 7. . . . 0 ??

6 7 . .. . 0

ACK:

7

ACK:

6

ACK:

1

Accept when thedel ay <=MaxSegment Li fe

Sequence Number SpaceSequence Number Space

MSL (Max Segment Life)>Variance in IPMSL (Max Segment Life)>Variance in IP MSL<8*|Sequence Number MSL<8*|Sequence Number

Space|/BandwidthSpace|/Bandwidth |SN Space|=2^31=2GB|SN Space|=2^31=2GB Bandwidth=1GBBandwidth=1GB MSL<=16secMSL<=16sec Variance of IP delay<=16 secVariance of IP delay<=16 sec Current TCP: 3 min.Current TCP: 3 min. Not scalable with bandwidth growthNot scalable with bandwidth growth

TCP-Extensions (1323)TCP-Extensions (1323) Window Spaces: 16bit Scale Factor in Window Spaces: 16bit Scale Factor in

SYN: Win=[Win]*2^SSYN: Win=[Win]*2^S RTT Measurement: Timestamp for each RTT Measurement: Timestamp for each

packet (generated by sender, relayed by packet (generated by sender, relayed by receiver)receiver)

PAWS (Protect Against Wrapped PAWS (Protect Against Wrapped Sequence Number): Use timestamp to Sequence Number): Use timestamp to expand the sequence space. (So the timer expand the sequence space. (So the timer should not be too fast or too slow: 1ms ~ should not be too fast or too slow: 1ms ~ 1 sec)1 sec)

Header Prediction: Simplify the process Header Prediction: Simplify the process

High Speed TCPHigh Speed TCP

Floyd ’02. Goals:Floyd ’02. Goals: Achieve large window size with realistic loss Achieve large window size with realistic loss

rate (Use current window size in AIMD rate (Use current window size in AIMD parameter)parameter)

High Speed in a single connection (10Gbps)High Speed in a single connection (10Gbps) Easy to achieve high sending rate for a given Easy to achieve high sending rate for a given

loss rate. How to Achieve TCP-Friendliness?loss rate. How to Achieve TCP-Friendliness? Incremental Deployable (no router support Incremental Deployable (no router support

required)required)

High Speed TCPHigh Speed TCP

Problem in Steady State:Problem in Steady State: TCP response function:TCP response function:

Large congestion window requires a Large congestion window requires a very low loss rate.very low loss rate.

Problem in Recovery:Problem in Recovery: Congestion Avoidance takes too long Congestion Avoidance takes too long

to recover (Consecutive Time-outs)to recover (Consecutive Time-outs)

Consecutive Time-outConsecutive Time-out

1

Consecutive Time-outConsecutive Time-out

1

Ti me Out: 1SS-Threshol d=cwnd/ 2Sl ow Start : cwnd=1

1

Consecutive Time-outConsecutive Time-out

1

Ti me Out: 1SS-Threshol d=cwnd/ 2Sl ow Start : cwnd=1

1

Ti me Out: R1SS-Threshol d=2

Sl ow Start : cwnd=1

1

Consecutive Time-outConsecutive Time-out

1

Ti me Out: 1SS-Threshol d=cwnd/ 2Sl ow Start : cwnd=1

1

Ti me Out: R1SS-Threshol d=2

Sl ow Start : cwnd=1

1cwnd=2

Congest i onAvoi dence

High Speed TCPHigh Speed TCP

Change the TCP response function:Change the TCP response function: p is high (higher than maxP corresponding to p is high (higher than maxP corresponding to

the default cwnd size W): standard TCPthe default cwnd size W): standard TCP p is low: (cwnd >= W): use a(w), b(w) instead p is low: (cwnd >= W): use a(w), b(w) instead

of constant a,b in the adjustment of cwnd.of constant a,b in the adjustment of cwnd. For a given loss rate P and desired windows SFor a given loss rate P and desired windows S

ize Wize W11 at P: get a(w) and b(w). (Keep the linea at P: get a(w) and b(w). (Keep the linearity on a log-log scale. ∆ logWrity on a log-log scale. ∆ logW∆ logP)∆ logP)

w

awwACK : bwwwDrop :

Change TCP FunctionChange TCP Function Standard TCP:Standard TCP:

l og(1. 5)/ 2

l og w=- ( l og p)/ 2+(l og1. 5)/ 2

l og w

l og p0

Change TCP FunctionChange TCP Function

l og(1. 5)/ 2

l og w=- ( l og p)/ 2+(l og1. 5)/ 2

l og w

l og p0l og P

l og W

Change TCP FunctionChange TCP Function

l og(1. 5)/ 2

l og w=- ( l og p)/ 2+(l og1. 5)/ 2

l og w

l og p0l og P

l og W

l og W1

l og P1

Change TCP FunctionChange TCP Function

l og(1. 5)/ 2

l og w=- ( l og p)/ 2+(l og1. 5)/ 2

l og w

l og p0l og P

l og W

l og W1

l og P1

ExpectationsExpectations

Achieve large window with realistic loss rateAchieve large window with realistic loss rate Relative fairness between standard TCP and Relative fairness between standard TCP and

High speed TCP (Acquired bandwidth High speed TCP (Acquired bandwidth cwn cwnd )d )

Moderate decrease instead of halving windoModerate decrease instead of halving window size when congestion detected (0.33 at 100w size when congestion detected (0.33 at 1000)0)

Pre-computed Look-upPre-computed Look-up to implement a(w) and b(w).to implement a(w) and b(w).

l og(1. 5) / 2

l og w=- ( l og p) / 2+(l og1. 5)/ 2

l og w

l og p0l og P

l og W

l og W1

l og P1

Slow StartSlow Start

Modification of Slow StaModification of Slow Start:rt:

Problem: doubling cwProblem: doubling cwnd for each RTT is too nd for each RTT is too aggressive for large cwaggressive for large cwndnd

Proposal: To limit ∆cProposal: To limit ∆cwnd in a RTT in Slow Swnd in a RTT in Slow Start.tart.

t

rate

Loss

Limited Slow StartLimited Slow Start

For each ACK:For each ACK: Cwnd<=max_ss_threshold:Cwnd<=max_ss_threshold:∆∆cwnd=MSS cwnd=MSS (Standard TCP Slow Start)(Standard TCP Slow Start) Cwnd>max_ss-threshold:Cwnd>max_ss-threshold:∆∆cwnd=0.5max_ss_threshold/cwnd=0.5max_ss_threshold/

cwnd cwnd (at most 0.5 max_ssthreshold e(at most 0.5 max_ssthreshold e

ach RTT)ach RTT)t

rate

max ssthreshol d

Related ProjectsRelated Projects

Cray Research (’92); Cray Research (’92); CASA Testbed (’94)CASA Testbed (’94) Duke (’99)Duke (’99) Pittsburg Supercomputing centerPittsburg Supercomputing center Portland State Univ.(’00)Portland State Univ.(’00) Internet 2 (’01)Internet 2 (’01) Web100Web100 Net100 (built on Web 100)Net100 (built on Web 100)

Cray Research ’92Cray Research ’92 TCP/IP Performance at Cray Research (TCP/IP Performance at Cray Research (Dave BoDave Bo

rman)rman)Configuration: Configuration: HIPPI between two dedicated Y/MPs with Model HIPPI between two dedicated Y/MPs with Model

E IOS and Unicos 8.0 E IOS and Unicos 8.0 Memory to memory transferMemory to memory transferResults:Results: Direct channel-to-channel: Direct channel-to-channel: MTU - 64K - 781 MbpsMTU - 64K - 781 Mbps Through a HIPPI switch:Through a HIPPI switch:MTU - 33K - 416 Mbps MTU - 33K - 416 Mbps MTU - 49K - 525 Mbps MTU - 49K - 525 Mbps MTU - 64K - 605 MbpsMTU - 64K - 605 Mbps

CASA Testbed ’94CASA Testbed ’94Applied Network Research of San Diego SupercApplied Network Research of San Diego Superc

omputer Center + UCSDomputer Center + UCSD Goal: Delay and Loss Characteristics of HIPPGoal: Delay and Loss Characteristics of HIPP

I-based gigabit testbedI-based gigabit testbed Link Feature: Blocking (HIPPI), tradeoff betwLink Feature: Blocking (HIPPI), tradeoff betw

een high lost rate and high delayeen high lost rate and high delay Conclusion: Avoiding packet loss is more impConclusion: Avoiding packet loss is more imp

ortant than reduce delayortant than reduce delay Performance (Delay*Bandwidth =2MB; 1323 oPerformance (Delay*Bandwidth =2MB; 1323 o

n; Cray machines): 500Mbps TCP sustained thn; Cray machines): 500Mbps TCP sustained throughput (TTCP/Netperf)roughput (TTCP/Netperf)

Trapeze/IP (Duke)Trapeze/IP (Duke)

Goal:Goal: What optimization is most useful to What optimization is most useful to

reduce host overheads for fast TCP?reduce host overheads for fast TCP? How fast does TCP really go, at what How fast does TCP really go, at what

cost?cost?Approaches:Approaches: Zero-CopyZero-Copy Checksum offloadingChecksum offloadingResult:Result: >900Mbps for MTU>8K>900Mbps for MTU>8K

Trapeze/IP (Duke)Trapeze/IP (Duke)

Zero-copy Zero-copy

www.cs.duke.edu/ari/publications/talks/freebsdconwww.cs.duke.edu/ari/publications/talks/freebsdcon

Trapeze/IP (Duke)Trapeze/IP (Duke)

www.cs.duke.edu/ari/publications/talks/freebsdconwww.cs.duke.edu/ari/publications/talks/freebsdcon

Trapeze/IP (Duke)Trapeze/IP (Duke)

www.cs.duke.edu/ari/publications/talks/freebsdconwww.cs.duke.edu/ari/publications/talks/freebsdcon

Trapeze/IP (Duke)Trapeze/IP (Duke)

www.cs.duke.edu/ari/publications/talks/freebsdconwww.cs.duke.edu/ari/publications/talks/freebsdcon

Enabling High Performance Enabling High Performance Data Transfers on HostsData Transfers on Hosts

By Pittsburg Supercomputing centerBy Pittsburg Supercomputing center Enable RFC 1191 MTU DiscoveryEnable RFC 1191 MTU Discovery Enable RFC 1323 Large WindowsEnable RFC 1323 Large Windows OS Kernel: Large enough socket OS Kernel: Large enough socket

buffersbuffers Application: Set its send and receive Application: Set its send and receive

socket buffer sizessocket buffer sizes

Detailed methods to tune various OS.Detailed methods to tune various OS.

PSU ExperimentPSU Experiment

Goal: Goal: Round Trip Delay and TCP throughput with Round Trip Delay and TCP throughput with

different window size different window size Influence by different devices (CISCO Influence by different devices (CISCO

3508/3524/5500), different NIC3508/3524/5500), different NIC

Environment:Environment: OS: FreeBSD 4.0/4.1 (without 1323?), OS: FreeBSD 4.0/4.1 (without 1323?),

Linux, SolarisLinux, Solaris WAN: 155Mbps OC-3 over SONET MANWAN: 155Mbps OC-3 over SONET MAN Measurement Tools: Ping + TTCPMeasurement Tools: Ping + TTCP

PSU ExperimentPSU Experiment "smaller" switches and low-level routers ca"smaller" switches and low-level routers ca

n easily muck things up.n easily muck things up. bugs in Linux 2.2 kernelsbugs in Linux 2.2 kernels Different NICs have different performance.Different NICs have different performance. Fast PCI bus (64 bits * 66mhz) is necessaryFast PCI bus (64 bits * 66mhz) is necessary Switch MTU size can make a difference (giaSwitch MTU size can make a difference (gia

nt packets are better).nt packets are better). Bigger TCP window sizes can help but therBigger TCP window sizes can help but ther

e seems to be a knee around 4MB that is noe seems to be a knee around 4MB that is not remarked upon in the literature. t remarked upon in the literature.

Internet-2 ExperimentInternet-2 ExperimentGoal: Single TCP connection with 700-800MbpGoal: Single TCP connection with 700-800Mbp

s over WAN; Relations among Window Size, s over WAN; Relations among Window Size, MTU and ThroughputMTU and Throughput

Back-to-BackBack-to-Back OS: FreeBSD 4.3 releaseOS: FreeBSD 4.3 release Architecture: 64bit-66Mhz PCI+…Architecture: 64bit-66Mhz PCI+… Configuration: sendspace=recvspace=10240Configuration: sendspace=recvspace=10240

00 Setup: Direct connection (back-back) and WSetup: Direct connection (back-back) and W

ANAN WAN: Symmetric path: host1-Abilene-host2WAN: Symmetric path: host1-Abilene-host2 Measurement: Ping + IPerfMeasurement: Ping + IPerf

Internet-2 ExperimentInternet-2 Experiment

Back-to-Back-to-BackBack

No LossNo Loss Found Found

some bug some bug in in FreeBSD FreeBSD 4.34.3

WindowWindow 4KB 4KB MTUMTU

8KB 8KB MTUMTU

512K512K 690690 855-986855-986

1M1M 658658 986986

2M2M 562562 986986

4M4M 217217 987987

8M8M 9393 987987

16M16M 8686 985985WAN: WAN: <=200Mbps<=200Mbps Asymmetry in different directions Asymmetry in different directions

(cache of MTU…)(cache of MTU…)

Web 100Web 100 Goal: Make it easy for non-expertise to Goal: Make it easy for non-expertise to

achieve high bandwidthachieve high bandwidth Method: Get more information from TCPMethod: Get more information from TCP Software: Software: Measurement: embedded into kernel TCPMeasurement: embedded into kernel TCPApp Layer: Diagnostics / Auto-TuningApp Layer: Diagnostics / Auto-Tuning Proposal:Proposal:RFC 2012 (MIB)RFC 2012 (MIB)

Net 100Net 100

Built on Web 100Built on Web 100 Auto-tune the parameter for non-Auto-tune the parameter for non-

experts.experts. Network-Aware OSNetwork-Aware OS Bulk File Transportation for ORNLBulk File Transportation for ORNL Implementation of Floyd’s High Implementation of Floyd’s High

Speed TCPSpeed TCP

Floyd’s TCP SS on Floyd’s TCP SS on Net100Net100

www.csm.ornl.gov/~dunigan/net100/floyd.htmlwww.csm.ornl.gov/~dunigan/net100/floyd.html

RTT:80msRTT:80ms

1MBsndwnd1MBsndwnd

2MBrcvwnd2MBrcvwnd

Cwnd:web100Cwnd:web100

Floyd’s TCP AIMD on Floyd’s TCP AIMD on Net100Net100

www.csm.ornl.gov/~dunigan/net100/floyd.htmlwww.csm.ornl.gov/~dunigan/net100/floyd.htmlRTT:87msRTT:87msWnd:1000segWnd:1000segMax_ss:100segMax_ss:100segSs:1.8secSs:1.8secMD at 1000:MD at 1000:*0.33/Timeout*0.33/TimeoutAI at 700:AI at 700:+8/RTT+8/RTTOld TCP:Old TCP:45sec recovery45sec recovery

Trend (Mathis: Oct 2001)Trend (Mathis: Oct 2001)

Trend (Mathis: Oct 2001)Trend (Mathis: Oct 2001)

TCP over Long Path:TCP over Long Path:

YearYear WizardWizard Non-Non-WizardWizard

RatioRatio

19881988 1Mbps1Mbps 300kbps300kbps 3:13:1

19911991 10Mbps10Mbps

19951995 100Mbps100Mbps

19991999 1Gbps1Gbps 3Mbps3Mbps 300:1300:1

Related ToolsRelated Tools

Measurement:Measurement: IPerfIPerf TCP DumpTCP Dump Web100Web100Emulation:Emulation: DummynetDummynet

NLANR-IperfNLANR-IperfFeature:Feature: Try to send data on user spaceTry to send data on user space Support: IPv4/IPv6Support: IPv4/IPv6 Support: TCP/UDP/Multicast…Support: TCP/UDP/Multicast… Similar software: Auto Tuning Enabled FTP Similar software: Auto Tuning Enabled FTP

Client/ServerClient/Server

Concern:Concern: Preemption by other processes in Gigabit Preemption by other processes in Gigabit

test? (Observation in Internet2 Experiment)test? (Observation in Internet2 Experiment)

Dummy NetDummy Net

Embedded in FreeBSD nowEmbedded in FreeBSD now Delay: delay in IP layerDelay: delay in IP layer Loss: random loss in IP layerLoss: random loss in IP layer

Concern: Concern: OverheadOverhead Pattern of packet lossPattern of packet loss

Current Status in Netlab@CalteCurrent Status in Netlab@Caltechch

100Mbps Testbed in netlab:100Mbps Testbed in netlab:

100MHub

UTP

Cabl

e

UTP Cabl eUTP Cabl e

Moni tor

Dri verDri ver

TCP

I P

TCP

I P+dummynet

Next Step Next Step

1Gbps Testbed in lab:1Gbps Testbed in lab:

Spl i t ter

Moni tor

Dri verDri ver

TCP

I P

TCP

I P+dummynet

Spl i t ter

Q&AQ&A