Post on 04-Jan-2016
Network-aware OS
DOE/MICS Project Review
August 18, 2003
Tom Dunigan thd@ornl.govMatt Mathis mathis@psc.eduBrian Tierney bltierney@lbl.gov
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
Roadmap• Motivation & Background• Net100 project components
– Web100– network probes & sensors– protocol analysis and tuning
• Results– TCP tuning daemon– Tuning experiments
• Ongoing & future research
www.net100.org
DOE-funded project (Office of Science) $2.6M, 3 yrs beginning 9/01 LBNL, ORNL, PSC, NCAR
Net100 project objectives: (network-aware operating systems)• measure, understand, and improve end-to-end network/application performance• tune network protocols and applications (grid and bulk transfer)• emphasis: TCP bulk transfer over high delay/bandwidth nets
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
Motivation
• Poor network application performance– High bandwidth paths, but app’s slow– Is it application? OS? network? … Yes– Often need a network “wizard”
• Changing: bandwidths– 9.6 Kbs… 1.5 Mbs ..45 …100…1000…? Gbs
• Unchanging: TCP– speed of light (RTT)– packet size (MSS/MTU) still 1500 bytes– TCP congestion control
• TCP is lossy by design !– 2x overshoot at startup, sawtooth– Recovery proportional to MSS/RTT2
– recovery after a loss can be very slow on today’s high delay/bandwidth links -- unacceptable on tomorrow’s links:
• 10 Gbs cross country: recovery time > 1 hr.!!
Linear recovery at 0.5 Mb/s!
Instantaneous bandwidth
Average bandwidth
Early startup losses
ORNL to NERSC ftp
8 Mbs
GigE/OC12 (600 Mbs) 80ms RTT
40 seconds
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
TCP 101
• adaptable and fair• flow-controlled by sender/receiver buffer sizes• self-clocking with positive ACK’s of in-sequence data• sensitive to packet size (MTU) and RTT• slow start -- +1 packet per each packet ACK’d (exponential)• congestion window (cwnd)-- max packets that can be in flight• packet loss: 3 dup ACKs or timeout (AIMD)
– cut cwnd in half (Multiplicative Decrease)– add 1 packet to cwnd per RTT (Additive Increase)
• Workarounds:– parallel streams– non-TCP (UDP) applications– Net100 (no changes to applications)
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
Net100 components
• Web100 Linux kernel (NSF)– instrumented TCP stack (IETF MIB draft)
• Path characterization– Network Tuning and Analysis Framework (NTAF)– both active and passive measurement tools– data base of measurements
• TCP protocol analysis and tuning– simulation/emulation
• ns• TCP-over-UDP (atou)• NISTNet
– kernel tuning extensions– tuning daemon
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
Web100• NSF funded (PSC/NCAR/NCSA) web100.org• Modified Linux kernel
– instrumented kernel to read/set TCP variables for a specific flow– readable: RTT, counts (bytes, pkts, retransmits,dups), state (SACKs, windowscale, cwnd,
ssthresh)– settable: buffer sizes– 100+ TCP variables (IETF MIB) ( /proc/web100/)
• GUI to display/modify a flow’s TCP variables, real-time• API for network-aware applications or tuning daemon• Net100 extensions:
– additional tuning variables and algorithms– event notification– Java bandwidth tester http://firebird.ccs.ornl.gov:7123
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
Network Tool Analysis Framework (NTAF)
• Configure and launch network tools– measure bandwidth/latency (iperf, pchar, pipechar)– augment tools to report Web100 data
• Collect and transform tool results – use Netlogger to transform common format
• Save results for short-term auto-tuning and archive for later analysis– compare predicted to actual performance– measure effectiveness of tools and auto-tuning– provide data that can be used to predict future
performance– invaluable for comparing tools (pathload/pchar/netest)
Net100 hosts at: LBNL,ORNL,PSC,NCAR NERSC, SLAC, UT, CERN, Amsterdam,ANL
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
TCP flow visualization
- Web interface for data archive and visualization
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
Monitoring Tool Comparison
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
TCP tuning
• “enable” high speed– need buffer = bandwidth*RTT - autotune
ORNL/NERSC (80 ms, OC12) need 6 MB
– faster slow-start• avoid losses
– modified slow-start– reduce bursts– anticipate loss (ECN,Vegas?) – reorder threshold
• speed recovery– bigger MTU or “virtual MSS”– modified AIMD (0.5,1) (Floyd, Kelly)– delayed ACKs, initial window, slow-start increment
• avoid congestion collapse, be fair (?) … intranets, QoS
• Net100: ns simulation, NISTNet emulation, “almost TCP over UDP” (atou), WAD/Internet
ns simulation: 500 mbs link, 80 ms RTTPacket loss early in slow start.Standard TCP with del ACK takes 10 minutes to recover!
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
TCP Tuning Daemon
• Work-around Daemon (WAD) – tune unknowing sender/receiver at startup and/or during flow– Web100 kernel extensions
• pre-set windowscale to allow dynamic tuning• uses netlink to alert daemon of socket open/close (or poll)• besides existing Web100 buffer tuning, new tuning parameters
and algorithms• knobs to disable Linux 2.4 caching, burst mgt., and sendstall
– config file with static tuning data• mode specifies dynamic tuning (AIMD options, NTAF buffer size,
concurrent streams)
– daemon periodically polls NTAF for fresh tuning data– can do out-of-kernel tuning (e.g., Floyd)– written in C (also Python version)
WAD config file [bob] src_addr: 0.0.0.0 src_port: 0 dst_addr: 10.5.128.74 dst_port: 0 mode: 1 sndbuf: 2000000 rcvbuf: 100000 wadai: 6 wadmd: 0.3 maxssth: 100 divide: 1 reorder: 9 sendstall: 0 delack: 0 floyd: 1 kellyai: 0
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
Experimental results
• Evaluating the tuning daemon in the wild– emphasis: bulk transfers over high delay/bandwidth nets (Internet2, ESnet)– tests over: 10GigE/OC192,OC48, OC12, OC3, ATM/VBR, GigE,FDDI,100/10T,cable,
ISDN,wireless (802.11b),dialup– tests over NISTNet testbed (speed, loss, delay)
• Various TCP tuning options– buffer tuning (static and dynamic/NTAF)– AIMD mods (including Floyd, Kelly, static, virtual MSS, and autotuning)– slow-start mods– parallel streams vs single tuned
NISTNethost
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
Buffer tuning
Classic buffer tuning•network-challenged app. gets 10 Mbs• same app., WAD/NTAF tuned buffer gets 143 Mbs
Autotuning buffers (kernel)• Linux 2.4, Feng’s Dynamic Right Sizing• Net100 autotuning
• receiver estimates RTT• receiver advertises window 2 times data recv’d in RTT• buffer size grows dynamically to 2x bandwidth*RTT• separate application buffers from kernel buffers
ORNL to PSC, OC192, 30 ms RTT
ORNL to PSC, OC12, 80ms RTT
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
Speeding recovery
Amsterdam-Chicago GigE via 10GigE, 100 ms RTT
UDP burst
Selectable TCP AIMD algorithms: Floyd HS TCP: as cwnd grows increase AI and decrease MD, do the reverse when cwnd shrinks Kelly scalable TCP: use MD of 1/8 instead of 1/2 and add % of cwnd (e.g. 1%) each RTT
Virtual MSS• tune TCP’s additive increase (WAD_AI)• add k segments per RTT during recovery• k=6 like GigE jumbo frame, but:
•interrupt rate not reduced•doesn’t do k segments for initial window
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
WAD tuning
Modified slow-start and AI• often losses in slow-start• WAD tuned Floyd slow-start and fixed AI (6)
WAD-tuned AIMD and slow-start • parallel streams AIMD (1/(2k),k)
•exploit TCP’s fairness• WAD-tuned single stream (0.125,4)• “ “ + Floyd slow-start
ORNL to NERSC, OC12, 80 ms RTT
ORNL to CERN, OC12, 150ms RTT
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
Workaround: parallel streams• Takes advantage of TCP’s fairness• Faster startup, k buffers• faster recovery
– often only 1 stream loses a packet– MD: 1/(2k) rather than 1/2– AI: k times faster linear phase
• BUT– requires rewrite of applications– how many streams? Buffer size?
• GridFTP, bbftp, psocket lib Alice and Bob sharing Clever Alice -- 3 streams
Bad girl ...
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
GridFTP tuning
Can tuned single stream compete with parallel streams?Mostly not with “equivalence” tuning, but sometimes…. Parallel streams have slow-start advantage.
WAD can divide buffer among concurrent flows—fairer/faster? Tests inconclusive so far…. Testing on real Internet is problematic.
Is there a “congestion metric”? Per unit of time? Flow Mbs congestion re-xmitsuntuned 28 4 30tuned 74 5 295parallel 52 30 401
untuned 25 7 25tuned 67 2 420parallel 88 17 440
Data/plots from Web100 tracerBuffers: 64K I/O, 4MB TCP
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
Ongoing Net100 research
– more user-friendly WAD– invited to submit Web100/Net100 mods to Linux 2.6– port of Web100 to FreeBSD (Web100 team)
• base for AIX, SGI, Solaris, OSF– port to ORNL Cray X1
• Linux network front-end• added Net100 kernel, 4x improvement in wide-area TCP!
– TCP Vegas• Vegas avoids loss (if RTT increasing, Vegas backs off)• can be configured to compete with standard TCP (Feng)• CalTech’s FAST
– comparison with other “work arounds”• parallel streams• non-TCP (SABUL, FOBS, TSUNAMI, RBUDP, SCTP)
– additional accelerants• slow-start initial/increment• reorder resiliance• delayed ACKs
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
TCP tuning for other OS’s
Reorder threshold• seeing more out of order packets (future: multipath?)• WAD tune a bigger reorder threshold for path
• 40x improvement!• Linux 2.4 does a good job already
• adjusts and caches reorder threshold• “undo” congestion avoidance
Delayed ACKs• WAD could turn off delayed ACKs 2x improvement in recovery rate and slow-start• Linux 2.4 already turns off delayed ACKs for initial slow-start
ns simulation: 500 mbs link, 80 ms RTTPacket loss early in slow-start.Standard TCP with del ACK takes 10 minutes to recover!NOTE aggressive static AIMD (Floyd pre-tune)
LBL to ORNL (using our TCP-over-UDP) : dup3 case had 289 retransmits, but all were unneeded!
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
Planned Net100 research
– improve ease of use (WAD WAND)– analyze effectiveness/fairness of current tuning options
• simulation• emulation• on the net (systematic tests)
– NTAF probes -- characterizing a path to tune a flow• integration with SCNM• monitoring applications with Web100• latest probe tools
– additional tuning algorithms• identify non-congestive loss, ECN?• Tuning for dedicated path (lambda/10GigE)
– parallel/multipath selection/tuning– WAD-to-WAD tuning– WAD caching – SGI/Linux
–jumbo frame experiments… the quest for bigger and bigger MTUs
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
Interactions
• Scientific applications– SciDAC supernova and global climate– Data grids (CERN, SLAC)– Radio telescopes (MIT)
• Middleware – Globus/gridFTP– HSI/HPSS
• Network measurement– Internet2 end-to-end– Pinger (Cottrell)– Claffy/Dovrolis pathload– netest (Guojun)– SCNM
• Protocol research – Dynamic Right-Sizing (Feng)– HS TCP (Floyd)– Scalable TCP (Kelly)– TCP Vegas (Feng, Low)– Tsunami/SABUL/FOBS/RBUDP– parallel streams (Hacker)
• OS vendors– Linux– IBM AIX/Linux – Cray X1
• Talks/papers/software/ www.net100.org
U.S. Department of Energy Office of Science LBNL/ORNL/PSC
Summary• Novel approaches
– non-invasive dynamic tuning of legacy applications– out-of-kernel tuning– using TCP to tune TCP – tuning on a per flow/destination based on recent path metrics or policy (QoS)
• Effective evaluation framework– protocol analysis and tuning – network/application/OS debugging– path characterization tools, archive, and visualization tools
• Performance improvements– WAD tuned:
• buffers 10x• AIMD 2x to 10x• delayed ACK 2x• slowstart 3x• reorder 40x
• Timely -- needed for science on today’s and tomorrow’s networks