For FAST Meeting July.2 Towards Gigabit David Wei Netlab@Caltech.
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of For FAST Meeting July.2 Towards Gigabit David Wei Netlab@Caltech.
For FAST Meeting July.2
Towards GigabitTowards Gigabit
David WeiDavid WeiNetlab@CaltechNetlab@Caltech
Potential ProblemsPotential Problems
Hardware / Driver / OSHardware / Driver / OS Protocol Stack OverheadProtocol Stack Overhead Scalability of the protocol specificatiScalability of the protocol specificati
onon TCP Stability /Utilization (New CongTCP Stability /Utilization (New Cong
estion Control Algorithm)estion Control Algorithm) Related Experiments & MeasuremenRelated Experiments & Measuremen
tsts
Hardware / Drivers /OSHardware / Drivers /OS NIC DriverNIC Driver Device ManagemeDevice Manageme
nt (Interrupt)nt (Interrupt) Redundant CopiesRedundant Copies
Device Polling (httDevice Polling (http://info.iet.unipi.it/p://info.iet.unipi.it/~luigi/polling/)~luigi/polling/)
Zero-Copy TCPZero-Copy TCP ……
www.cs.duke.edu/ari/publications/talks/freebsdconwww.cs.duke.edu/ari/publications/talks/freebsdcon
Device PollingDevice Polling Current process for NIC driver in FreeBSD:Current process for NIC driver in FreeBSD:1.1. Packet come to NICPacket come to NIC2.2. NIC->Hardware InterruptNIC->Hardware Interrupt3.3. CPU jumps to the interrupt handler for that NICCPU jumps to the interrupt handler for that NIC4.4. MAC layer process reads data from NIC to a MAC layer process reads data from NIC to a
queuequeue5.5. Upper layer process the data in queue (lower Upper layer process the data in queue (lower
priority)priority) Drawback:Drawback:CPU checks the NIC for every packet -- Context CPU checks the NIC for every packet -- Context
switching.switching.Frequent interruption for high speed deviceFrequent interruption for high speed device Live-Lock:Live-Lock:CPU is too busy working on NIC interruption to CPU is too busy working on NIC interruption to
process the data in the queue. process the data in the queue.
Device PollingDevice Polling
Device Polling:Device Polling: Polling: CPU checks the device when it Polling: CPU checks the device when it
has time. has time. Scheduling: User specifies a time ratio Scheduling: User specifies a time ratio
for CPU to work on devices and on for CPU to work on devices and on non-device processing.non-device processing.
Advantages:Advantages: Balance between the device service and Balance between the device service and
non-device processingnon-device processing Improve performance in fast devicesImprove performance in fast devices
Protocol Stack OverheadProtocol Stack Overhead
Per-packet over head:Per-packet over head: Ethernet Header / ChecksumEthernet Header / Checksum IP Header / ChecksumIP Header / Checksum TCP Header / ChecksumTCP Header / Checksum Coping / interruption processCoping / interruption process
Solution: Increase packet sizeSolution: Increase packet size Opt Packet Size=min{ packet size Opt Packet Size=min{ packet size
along the path} (Fragmentation results along the path} (Fragmentation results in low performance too.)in low performance too.)
Path MTU Discovery Path MTU Discovery (1191)(1191)
Current Method:Current Method: ““Don’t Fragment” bits Don’t Fragment” bits (Router: Drop/Fragment; Host: (Router: Drop/Fragment; Host:
Test/Enforce)Test/Enforce) MTU=min{576, first hop MTU}MTU=min{576, first hop MTU} MSS=MTU-40MSS=MTU-40 MTU<=65535 (Architecture)MTU<=65535 (Architecture) MSS<=65495 (IP sign-bit bugs…)MSS<=65495 (IP sign-bit bugs…) Drawback: Usually too smallDrawback: Usually too small
Path MTU DiscoveryPath MTU Discovery How to Discover PMTU?How to Discover PMTU?
Current:Current: Search (Proportional Decreasing / Search (Proportional Decreasing /
Binary)Binary) Update (Periodically Increasing – set to Update (Periodically Increasing – set to
the MTU of first hop)the MTU of first hop)
Proposed:Proposed: Search/Update with typical MTU valuesSearch/Update with typical MTU values Routers: provide suggestion of MTU in Routers: provide suggestion of MTU in
DTB indicating the DF pack drop.DTB indicating the DF pack drop.
Path MTU DiscoveryPath MTU DiscoveryImplementationImplementationHost:Host: Packetization Layer (TCP / Connection over UDPacketization Layer (TCP / Connection over UD
P): DF/Packet SizeP): DF/Packet Size IP: Store PMTU for each known path (routing taIP: Store PMTU for each known path (routing ta
ble)ble) ICMP: “Datagram Too Big” MessageICMP: “Datagram Too Big” MessageRouter:Router: Send ICMP Packet when Datagram is too big.Send ICMP Packet when Datagram is too big.Implementation problems:Implementation problems: RFC 2923RFC 2923
Scalability of Protocol Scalability of Protocol SpecificationsSpecifications
Windows Size Space (<=64K)Windows Size Space (<=64K) Sequence Number Space (Wrapping Sequence Number Space (Wrapping
up, <=2G)up, <=2G) Inadequate Frequency of RTT Inadequate Frequency of RTT
Sampling (1 sample per Window)Sampling (1 sample per Window)
3 4 51 2
3 4 51 2
Sequence Number SpaceSequence Number Space
6 7. . . . 0 ??
6 7 . .. . 0
ACK:
7
ACK:
6
ACK:
1Accept when the
del ay <=MaxSegment Li fe
Sequence Number SpaceSequence Number Space
MSL (Max Segment Life)>Variance of IP MSL (Max Segment Life)>Variance of IP delaydelay
MSL<Sequence Number Space/BandwidthMSL<Sequence Number Space/Bandwidth
6 7. . . . 0 ??
6 7 . .. . 0
ACK:
7
ACK:
6
ACK:
1
Accept when thedel ay <=MaxSegment Li fe
Sequence Number SpaceSequence Number Space
MSL (Max Segment Life)>Variance in IPMSL (Max Segment Life)>Variance in IP MSL<8*|Sequence Number MSL<8*|Sequence Number
Space|/BandwidthSpace|/Bandwidth |SN Space|=2^31=2GB|SN Space|=2^31=2GB Bandwidth=1GBBandwidth=1GB MSL<=16secMSL<=16sec Variance of IP delay<=16 secVariance of IP delay<=16 sec Current TCP: 3 min.Current TCP: 3 min. Not scalable with bandwidth growthNot scalable with bandwidth growth
TCP-Extensions (1323)TCP-Extensions (1323) Window Spaces: 16bit Scale Factor in Window Spaces: 16bit Scale Factor in
SYN: Win=[Win]*2^SSYN: Win=[Win]*2^S RTT Measurement: Timestamp for each RTT Measurement: Timestamp for each
packet (generated by sender, relayed by packet (generated by sender, relayed by receiver)receiver)
PAWS (Protect Against Wrapped PAWS (Protect Against Wrapped Sequence Number): Use timestamp to Sequence Number): Use timestamp to expand the sequence space. (So the timer expand the sequence space. (So the timer should not be too fast or too slow: 1ms ~ should not be too fast or too slow: 1ms ~ 1 sec)1 sec)
Header Prediction: Simplify the process Header Prediction: Simplify the process
High Speed TCPHigh Speed TCP
Floyd ’02. Goals:Floyd ’02. Goals: Achieve large window size with realistic loss Achieve large window size with realistic loss
rate (Use current window size in AIMD rate (Use current window size in AIMD parameter)parameter)
High Speed in a single connection (10Gbps)High Speed in a single connection (10Gbps) Easy to achieve high sending rate for a given Easy to achieve high sending rate for a given
loss rate. How to Achieve TCP-Friendliness?loss rate. How to Achieve TCP-Friendliness? Incremental Deployable (no router support Incremental Deployable (no router support
required)required)
High Speed TCPHigh Speed TCP
Problem in Steady State:Problem in Steady State: TCP response function:TCP response function:
Large congestion window requires a Large congestion window requires a very low loss rate.very low loss rate.
Problem in Recovery:Problem in Recovery: Congestion Avoidance takes too long Congestion Avoidance takes too long
to recover (Consecutive Time-outs)to recover (Consecutive Time-outs)
Consecutive Time-outConsecutive Time-out
1
Ti me Out: 1SS-Threshol d=cwnd/ 2Sl ow Start : cwnd=1
1
Ti me Out: R1SS-Threshol d=2
Sl ow Start : cwnd=1
1
Consecutive Time-outConsecutive Time-out
1
Ti me Out: 1SS-Threshol d=cwnd/ 2Sl ow Start : cwnd=1
1
Ti me Out: R1SS-Threshol d=2
Sl ow Start : cwnd=1
1cwnd=2
Congest i onAvoi dence
High Speed TCPHigh Speed TCP
Change the TCP response function:Change the TCP response function: p is high (higher than maxP corresponding to p is high (higher than maxP corresponding to
the default cwnd size W): standard TCPthe default cwnd size W): standard TCP p is low: (cwnd >= W): use a(w), b(w) instead p is low: (cwnd >= W): use a(w), b(w) instead
of constant a,b in the adjustment of cwnd.of constant a,b in the adjustment of cwnd. For a given loss rate P and desired windows SFor a given loss rate P and desired windows S
ize Wize W11 at P: get a(w) and b(w). (Keep the linea at P: get a(w) and b(w). (Keep the linearity on a log-log scale. ∆ logWrity on a log-log scale. ∆ logW∆ logP)∆ logP)
w
awwACK : bwwwDrop :
Change TCP FunctionChange TCP Function Standard TCP:Standard TCP:
l og(1. 5)/ 2
l og w=- ( l og p)/ 2+(l og1. 5)/ 2
l og w
l og p0
Change TCP FunctionChange TCP Function
l og(1. 5)/ 2
l og w=- ( l og p)/ 2+(l og1. 5)/ 2
l og w
l og p0l og P
l og W
Change TCP FunctionChange TCP Function
l og(1. 5)/ 2
l og w=- ( l og p)/ 2+(l og1. 5)/ 2
l og w
l og p0l og P
l og W
l og W1
l og P1
Change TCP FunctionChange TCP Function
l og(1. 5)/ 2
l og w=- ( l og p)/ 2+(l og1. 5)/ 2
l og w
l og p0l og P
l og W
l og W1
l og P1
ExpectationsExpectations
Achieve large window with realistic loss rateAchieve large window with realistic loss rate Relative fairness between standard TCP and Relative fairness between standard TCP and
High speed TCP (Acquired bandwidth High speed TCP (Acquired bandwidth cwn cwnd )d )
Moderate decrease instead of halving windoModerate decrease instead of halving window size when congestion detected (0.33 at 100w size when congestion detected (0.33 at 1000)0)
Pre-computed Look-upPre-computed Look-up to implement a(w) and b(w).to implement a(w) and b(w).
l og(1. 5) / 2
l og w=- ( l og p) / 2+(l og1. 5)/ 2
l og w
l og p0l og P
l og W
l og W1
l og P1
Slow StartSlow Start
Modification of Slow StaModification of Slow Start:rt:
Problem: doubling cwProblem: doubling cwnd for each RTT is too nd for each RTT is too aggressive for large cwaggressive for large cwndnd
Proposal: To limit ∆cProposal: To limit ∆cwnd in a RTT in Slow Swnd in a RTT in Slow Start.tart.
t
rate
Loss
Limited Slow StartLimited Slow Start
For each ACK:For each ACK: Cwnd<=max_ss_threshold:Cwnd<=max_ss_threshold:∆∆cwnd=MSS cwnd=MSS (Standard TCP Slow Start)(Standard TCP Slow Start) Cwnd>max_ss-threshold:Cwnd>max_ss-threshold:∆∆cwnd=0.5max_ss_threshold/cwnd=0.5max_ss_threshold/
cwnd cwnd (at most 0.5 max_ssthreshold e(at most 0.5 max_ssthreshold e
ach RTT)ach RTT)t
rate
max ssthreshol d
Related ProjectsRelated Projects
Cray Research (’92); Cray Research (’92); CASA Testbed (’94)CASA Testbed (’94) Duke (’99)Duke (’99) Pittsburg Supercomputing centerPittsburg Supercomputing center Portland State Univ.(’00)Portland State Univ.(’00) Internet 2 (’01)Internet 2 (’01) Web100Web100 Net100 (built on Web 100)Net100 (built on Web 100)
Cray Research ’92Cray Research ’92 TCP/IP Performance at Cray Research (TCP/IP Performance at Cray Research (Dave BoDave Bo
rman)rman)Configuration: Configuration: HIPPI between two dedicated Y/MPs with Model HIPPI between two dedicated Y/MPs with Model
E IOS and Unicos 8.0 E IOS and Unicos 8.0 Memory to memory transferMemory to memory transferResults:Results: Direct channel-to-channel: Direct channel-to-channel: MTU - 64K - 781 MbpsMTU - 64K - 781 Mbps Through a HIPPI switch:Through a HIPPI switch:MTU - 33K - 416 Mbps MTU - 33K - 416 Mbps MTU - 49K - 525 Mbps MTU - 49K - 525 Mbps MTU - 64K - 605 MbpsMTU - 64K - 605 Mbps
CASA Testbed ’94CASA Testbed ’94Applied Network Research of San Diego SupercApplied Network Research of San Diego Superc
omputer Center + UCSDomputer Center + UCSD Goal: Delay and Loss Characteristics of HIPPGoal: Delay and Loss Characteristics of HIPP
I-based gigabit testbedI-based gigabit testbed Link Feature: Blocking (HIPPI), tradeoff betwLink Feature: Blocking (HIPPI), tradeoff betw
een high lost rate and high delayeen high lost rate and high delay Conclusion: Avoiding packet loss is more impConclusion: Avoiding packet loss is more imp
ortant than reduce delayortant than reduce delay Performance (Delay*Bandwidth =2MB; 1323 oPerformance (Delay*Bandwidth =2MB; 1323 o
n; Cray machines): 500Mbps TCP sustained thn; Cray machines): 500Mbps TCP sustained throughput (TTCP/Netperf)roughput (TTCP/Netperf)
Trapeze/IP (Duke)Trapeze/IP (Duke)
Goal:Goal: What optimization is most useful to What optimization is most useful to
reduce host overheads for fast TCP?reduce host overheads for fast TCP? How fast does TCP really go, at what How fast does TCP really go, at what
cost?cost?Approaches:Approaches: Zero-CopyZero-Copy Checksum offloadingChecksum offloadingResult:Result: >900Mbps for MTU>8K>900Mbps for MTU>8K
Trapeze/IP (Duke)Trapeze/IP (Duke)
Zero-copy Zero-copy
www.cs.duke.edu/ari/publications/talks/freebsdconwww.cs.duke.edu/ari/publications/talks/freebsdcon
Trapeze/IP (Duke)Trapeze/IP (Duke)
www.cs.duke.edu/ari/publications/talks/freebsdconwww.cs.duke.edu/ari/publications/talks/freebsdcon
Trapeze/IP (Duke)Trapeze/IP (Duke)
www.cs.duke.edu/ari/publications/talks/freebsdconwww.cs.duke.edu/ari/publications/talks/freebsdcon
Trapeze/IP (Duke)Trapeze/IP (Duke)
www.cs.duke.edu/ari/publications/talks/freebsdconwww.cs.duke.edu/ari/publications/talks/freebsdcon
Enabling High Performance Enabling High Performance Data Transfers on HostsData Transfers on Hosts
By Pittsburg Supercomputing centerBy Pittsburg Supercomputing center Enable RFC 1191 MTU DiscoveryEnable RFC 1191 MTU Discovery Enable RFC 1323 Large WindowsEnable RFC 1323 Large Windows OS Kernel: Large enough socket OS Kernel: Large enough socket
buffersbuffers Application: Set its send and receive Application: Set its send and receive
socket buffer sizessocket buffer sizes
Detailed methods to tune various OS.Detailed methods to tune various OS.
PSU ExperimentPSU Experiment
Goal: Goal: Round Trip Delay and TCP throughput with Round Trip Delay and TCP throughput with
different window size different window size Influence by different devices (CISCO Influence by different devices (CISCO
3508/3524/5500), different NIC3508/3524/5500), different NIC
Environment:Environment: OS: FreeBSD 4.0/4.1 (without 1323?), OS: FreeBSD 4.0/4.1 (without 1323?),
Linux, SolarisLinux, Solaris WAN: 155Mbps OC-3 over SONET MANWAN: 155Mbps OC-3 over SONET MAN Measurement Tools: Ping + TTCPMeasurement Tools: Ping + TTCP
PSU ExperimentPSU Experiment "smaller" switches and low-level routers ca"smaller" switches and low-level routers ca
n easily muck things up.n easily muck things up. bugs in Linux 2.2 kernelsbugs in Linux 2.2 kernels Different NICs have different performance.Different NICs have different performance. Fast PCI bus (64 bits * 66mhz) is necessaryFast PCI bus (64 bits * 66mhz) is necessary Switch MTU size can make a difference (giaSwitch MTU size can make a difference (gia
nt packets are better).nt packets are better). Bigger TCP window sizes can help but therBigger TCP window sizes can help but ther
e seems to be a knee around 4MB that is noe seems to be a knee around 4MB that is not remarked upon in the literature. t remarked upon in the literature.
Internet-2 ExperimentInternet-2 ExperimentGoal: Single TCP connection with 700-800MbpGoal: Single TCP connection with 700-800Mbp
s over WAN; Relations among Window Size, s over WAN; Relations among Window Size, MTU and ThroughputMTU and Throughput
Back-to-BackBack-to-Back OS: FreeBSD 4.3 releaseOS: FreeBSD 4.3 release Architecture: 64bit-66Mhz PCI+…Architecture: 64bit-66Mhz PCI+… Configuration: sendspace=recvspace=10240Configuration: sendspace=recvspace=10240
00 Setup: Direct connection (back-back) and WSetup: Direct connection (back-back) and W
ANAN WAN: Symmetric path: host1-Abilene-host2WAN: Symmetric path: host1-Abilene-host2 Measurement: Ping + IPerfMeasurement: Ping + IPerf
Internet-2 ExperimentInternet-2 Experiment
Back-to-Back-to-BackBack
No LossNo Loss Found Found
some bug some bug in in FreeBSD FreeBSD 4.34.3
WindowWindow 4KB 4KB MTUMTU
8KB 8KB MTUMTU
512K512K 690690 855-986855-986
1M1M 658658 986986
2M2M 562562 986986
4M4M 217217 987987
8M8M 9393 987987
16M16M 8686 985985WAN: WAN: <=200Mbps<=200Mbps Asymmetry in different directions Asymmetry in different directions
(cache of MTU…)(cache of MTU…)
Web 100Web 100 Goal: Make it easy for non-expertise to Goal: Make it easy for non-expertise to
achieve high bandwidthachieve high bandwidth Method: Get more information from TCPMethod: Get more information from TCP Software: Software: Measurement: embedded into kernel TCPMeasurement: embedded into kernel TCPApp Layer: Diagnostics / Auto-TuningApp Layer: Diagnostics / Auto-Tuning Proposal:Proposal:RFC 2012 (MIB)RFC 2012 (MIB)
Net 100Net 100
Built on Web 100Built on Web 100 Auto-tune the parameter for non-Auto-tune the parameter for non-
experts.experts. Network-Aware OSNetwork-Aware OS Bulk File Transportation for ORNLBulk File Transportation for ORNL Implementation of Floyd’s High Implementation of Floyd’s High
Speed TCPSpeed TCP
Floyd’s TCP SS on Floyd’s TCP SS on Net100Net100
www.csm.ornl.gov/~dunigan/net100/floyd.htmlwww.csm.ornl.gov/~dunigan/net100/floyd.html
RTT:80msRTT:80ms
1MBsndwnd1MBsndwnd
2MBrcvwnd2MBrcvwnd
Cwnd:web100Cwnd:web100
Floyd’s TCP AIMD on Floyd’s TCP AIMD on Net100Net100
www.csm.ornl.gov/~dunigan/net100/floyd.htmlwww.csm.ornl.gov/~dunigan/net100/floyd.htmlRTT:87msRTT:87msWnd:1000segWnd:1000segMax_ss:100segMax_ss:100segSs:1.8secSs:1.8secMD at 1000:MD at 1000:*0.33/Timeout*0.33/TimeoutAI at 700:AI at 700:+8/RTT+8/RTTOld TCP:Old TCP:45sec recovery45sec recovery
Trend (Mathis: Oct 2001)Trend (Mathis: Oct 2001)
TCP over Long Path:TCP over Long Path:
YearYear WizardWizard Non-Non-WizardWizard
RatioRatio
19881988 1Mbps1Mbps 300kbps300kbps 3:13:1
19911991 10Mbps10Mbps
19951995 100Mbps100Mbps
19991999 1Gbps1Gbps 3Mbps3Mbps 300:1300:1
Related ToolsRelated Tools
Measurement:Measurement: IPerfIPerf TCP DumpTCP Dump Web100Web100Emulation:Emulation: DummynetDummynet
NLANR-IperfNLANR-IperfFeature:Feature: Try to send data on user spaceTry to send data on user space Support: IPv4/IPv6Support: IPv4/IPv6 Support: TCP/UDP/Multicast…Support: TCP/UDP/Multicast… Similar software: Auto Tuning Enabled FTP Similar software: Auto Tuning Enabled FTP
Client/ServerClient/Server
Concern:Concern: Preemption by other processes in Gigabit Preemption by other processes in Gigabit
test? (Observation in Internet2 Experiment)test? (Observation in Internet2 Experiment)
Dummy NetDummy Net
Embedded in FreeBSD nowEmbedded in FreeBSD now Delay: delay in IP layerDelay: delay in IP layer Loss: random loss in IP layerLoss: random loss in IP layer
Concern: Concern: OverheadOverhead Pattern of packet lossPattern of packet loss
Current Status in Netlab@CalteCurrent Status in Netlab@Caltechch
100Mbps Testbed in netlab:100Mbps Testbed in netlab:
100MHub
UTP
Cabl
e
UTP Cabl eUTP Cabl e
Moni tor
Dri verDri ver
TCP
I P
TCP
I P+dummynet
Next Step Next Step
1Gbps Testbed in lab:1Gbps Testbed in lab:
Spl i t ter
Moni tor
Dri verDri ver
TCP
I P
TCP
I P+dummynet
Spl i t ter