1 Evaluating NGI performance Matt Mathis [email protected].
-
Upload
eleanor-oconnor -
Category
Documents
-
view
225 -
download
0
Transcript of 1 Evaluating NGI performance Matt Mathis [email protected].
2
Evaluating NGI Performance
• How well is the NGI being used?
• Where can we do better?
3
Outline
• Why is this such a hard problem?– Architectural reasons– Scale
• A systematic approach
4
TCP/IP Layering
• The good news:– TCP/IP hides the details of the network from
users and applications– This is largely responsible for the explosive
growth of the Internet
5
TCP/IP Layering
• The bad news:– All bugs and inefficiencies are hidden from
users, applications and network administrators– The only legal symptoms for any problem
anywhere are connection failures or less than expected performance
6
Six performance problems
• IP Path – Packet routing, round trip time– Packet reordering– Packet losses, Congestion, Lame HW
• Host or end-system – MSS negotiation, MTU discovery– TCP sender or receiver buffer space– Inefficient applications
7
Layering obscures problems
• Consider: trying to fix the weakest link of an invisible chain
• Typical users, system and network administrators routinely fail to “tune” their own systems
• In the future, WEB100 will help…
8
NGI Measurement Challenges
• The NGI is so large and complex that you can not observe all of it directly.
• We want to assess both network and end-system problems– The problems mask each other– The users & admins can’t even diagnose their
own problems
9
The Strategy
• Decouple paths from end-systems– Test some paths using well understood end-
systems – Collect packet traces and algorithmically
characterize performance problems
10
• TCP bulk transport (path limitation):
• Sender or receiver TCP buffer space:
• Application, CPU or other I/O limit
Performance is minimum of:
SizeRTT
MSSRate
p
C
RTT
MSSRate
7.0C
11
Packet trace instrumentation
• Independent measures of model:– Data rate, MSS, RTT and p– Measure independent distributions for each
• Detect end system limitations– Whenever the model does not fit
p
C
RTT
MSSRate
12
The Experiments
• Actively test a (small) collection of paths with carefully tuned systems
• Passively trace and diagnose all traffic at a small number of points to observe large collections of paths and end systems.
• [Wanted] Passively observe flow statistics for many NGI paths to take a complete census of all end systems capable of high data rates.
13
Active Path Testing
• Use uniform test systems– Mostly Hans Werner Braun’s AMP systems
– Well tuned systems and application– Known TCP properties
• Star topology from PSC for initial tests– Evolve to multi star and sparse mesh
• Use passive instrumentation
14
Typical (Active) Data
• 83 paths measured• For the moment assume:
– All host problems have been eliminated– All bottlenecks are due to the path
• Use traces to measure path properties– Rate, MSS, and RTT– Estimate window sizes and loss interval
• Sample has target selection bias
15
Data Rate
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50 60 70 80
M bits/s
CD
F
16
Data Rate Observations
• Only one path performed well– (74 Mbit/s)
• About 15% of the paths beat 100MB/30s– (27 Mbit/s)
• About half of the paths were below old Ethernet rates– (10 Mbit/s)
17
Round Trip Times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 50 100 150 200
RTT (ms)
CD
F
18
RTT Observations
• About 25% of the RTTs are too high(PSC to San Diego is ~70 ms)
– Many reflect routing problems– At least a few are queuing (traffic) related
19
Loss Interval (1/p)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06
Packets between losses
CD
F
20
Loss Interval Observations
• Only a few paths do very well– Some low-loss paths have high delay
• Only paths with fewer losses than 10 per million are ok
• Finding packet losses at this level can be difficult
21
Passive trace diagnose
• Trace Analysis and Automatic Diagnosis (TAAD)
• Passively observe user traffic to measure the network
• These are very early results
22
Example Passive Data
• Traffic is through the Pittsburgh GigaPoP
• Collected with MCI/NLANR/CAIDA OC3-mon and coralreef software
• This data set is mostly commodity traffic
• Future data sets will be self weighted NGI samples
23
Observed and Predicted Window
• Window can be observed by looking at TCP retransmissions
• Window can be predicted from the observed interval between losses
• If they agree the flow is path limited– The bulk performance model fits the data
• If they don’t, the flow is end system limited– Observed window is probably due to buffer limits but
may be due to other bottlenecks
24
25
Window Sizes
0
10000
20000
30000
40000
50000
60000
70000
80000
0 10 20 30 40 50
Window (k Bytes)
Po
pu
lati
on
Observed
Predicted
26
Window Sizes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50
Window (kBytes)
CD
F Observed
Predicted
27
Observations
• 60% of the commodity flows are path limited with window sizes smaller than 5kBytes
• Huge discontinuity at 8kBytes reflects common default buffer limits
• About 15% of the flows are affected by this limit
28
Need NGI host census
• Populations of end systems which have reached significant performance plateaus
• Have solved “all” performance problems
• Confirm other distributions
• Best collected within the network itself
29
Conclusion
• TCP/IP layering confounds diagnosis– Especially with multiple problems
• Many pervasive network and host problems– Multiple problems seem to be the norm
• Better diagnosis requires better visibility– Ergo WEB100