1 Evaluating NGI performance Matt Mathis [email protected].

1

Evaluating NGI performance

Matt Mathis

[email protected]

2

Evaluating NGI Performance

• How well is the NGI being used?

• Where can we do better?

3

Outline

• Why is this such a hard problem?– Architectural reasons– Scale

• A systematic approach

4

TCP/IP Layering

• The good news:– TCP/IP hides the details of the network from

users and applications– This is largely responsible for the explosive

growth of the Internet

5

TCP/IP Layering

• The bad news:– All bugs and inefficiencies are hidden from

users, applications and network administrators– The only legal symptoms for any problem

anywhere are connection failures or less than expected performance

6

Six performance problems

• IP Path – Packet routing, round trip time– Packet reordering– Packet losses, Congestion, Lame HW

• Host or end-system – MSS negotiation, MTU discovery– TCP sender or receiver buffer space– Inefficient applications

7

Layering obscures problems

• Consider: trying to fix the weakest link of an invisible chain

• Typical users, system and network administrators routinely fail to “tune” their own systems

• In the future, WEB100 will help…

8

NGI Measurement Challenges

• The NGI is so large and complex that you can not observe all of it directly.

• We want to assess both network and end-system problems– The problems mask each other– The users & admins can’t even diagnose their

own problems

9

The Strategy

• Decouple paths from end-systems– Test some paths using well understood end-

systems – Collect packet traces and algorithmically

characterize performance problems

10

• TCP bulk transport (path limitation):

• Sender or receiver TCP buffer space:

• Application, CPU or other I/O limit

Performance is minimum of:

SizeRTT

MSSRate

p

C

RTT

MSSRate

7.0C

11

Packet trace instrumentation

• Independent measures of model:– Data rate, MSS, RTT and p– Measure independent distributions for each

• Detect end system limitations– Whenever the model does not fit

p

C

RTT

MSSRate

12

The Experiments

• Actively test a (small) collection of paths with carefully tuned systems

• Passively trace and diagnose all traffic at a small number of points to observe large collections of paths and end systems.

• [Wanted] Passively observe flow statistics for many NGI paths to take a complete census of all end systems capable of high data rates.

13

Active Path Testing

• Use uniform test systems– Mostly Hans Werner Braun’s AMP systems

– Well tuned systems and application– Known TCP properties

• Star topology from PSC for initial tests– Evolve to multi star and sparse mesh

• Use passive instrumentation

14

Typical (Active) Data

• 83 paths measured• For the moment assume:

– All host problems have been eliminated– All bottlenecks are due to the path

• Use traces to measure path properties– Rate, MSS, and RTT– Estimate window sizes and loss interval

• Sample has target selection bias

15

Data Rate

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80

M bits/s

CD

F

16

Data Rate Observations

• Only one path performed well– (74 Mbit/s)

• About 15% of the paths beat 100MB/30s– (27 Mbit/s)

• About half of the paths were below old Ethernet rates– (10 Mbit/s)

17

Round Trip Times

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200

RTT (ms)

CD

F

18

RTT Observations

• About 25% of the RTTs are too high(PSC to San Diego is ~70 ms)

– Many reflect routing problems– At least a few are queuing (traffic) related

19

Loss Interval (1/p)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06

Packets between losses

CD

F

20

Loss Interval Observations

• Only a few paths do very well– Some low-loss paths have high delay

• Only paths with fewer losses than 10 per million are ok

• Finding packet losses at this level can be difficult

21

Passive trace diagnose

• Trace Analysis and Automatic Diagnosis (TAAD)

• Passively observe user traffic to measure the network

• These are very early results

22

Example Passive Data

• Traffic is through the Pittsburgh GigaPoP

• Collected with MCI/NLANR/CAIDA OC3-mon and coralreef software

• This data set is mostly commodity traffic

• Future data sets will be self weighted NGI samples

23

Observed and Predicted Window

• Window can be observed by looking at TCP retransmissions

• Window can be predicted from the observed interval between losses

• If they agree the flow is path limited– The bulk performance model fits the data

• If they don’t, the flow is end system limited– Observed window is probably due to buffer limits but

may be due to other bottlenecks

25

Window Sizes

0

10000

20000

30000

40000

50000

60000

70000

80000

0 10 20 30 40 50

Window (k Bytes)

Po

pu

lati

on

Observed

Predicted

26

Window Sizes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50

Window (kBytes)

CD

F Observed

Predicted

27

Observations

• 60% of the commodity flows are path limited with window sizes smaller than 5kBytes

• Huge discontinuity at 8kBytes reflects common default buffer limits

• About 15% of the flows are affected by this limit

28

Need NGI host census

• Populations of end systems which have reached significant performance plateaus

• Have solved “all” performance problems

• Confirm other distributions

• Best collected within the network itself

29

Conclusion

• TCP/IP layering confounds diagnosis– Especially with multiple problems

• Many pervasive network and host problems– Multiple problems seem to be the norm

• Better diagnosis requires better visibility– Ergo WEB100

1 Evaluating NGI performance Matt Mathis [email protected].

Documents

Transcript of 1 Evaluating NGI performance Matt Mathis [email protected].