1 Correlating Internet Performance & Route Changes to Assist in Trouble- shooting from an End-user...

19
1 Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil SLAC Passive and Active Monitoring Workshop Antibes, Juan-les-Pins, France April 19-20, 2004 www.slac.stanford.edu/grp/scs/net/talk03/pam04.ppt Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of 1 Correlating Internet Performance & Route Changes to Assist in Trouble- shooting from an End-user...

1

Correlating Internet Performance & Route Changes to Assist in Trouble-

shooting from an End-user PerspectiveLes Cottrell, Connie Logg, Jiri Navratil SLAC

Passive and Active Monitoring WorkshopAntibes, Juan-les-Pins, France

April 19-20, 2004 www.slac.stanford.edu/grp/scs/net/talk03/pam04.ppt

Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also

supported by IUPAP

2

OutlineSet of integrated measurement tools to aid in

troubleshooting for end “user”

• Traceroute measurements/analysis

• Topology visualization

• Lightweight bandwidth estimation

• Overall visualization

• Level change anomaly automated detection

• Correlation of performance & route changes

3

Traceroute measurement

• Every 10 minutes for each host – Run standard traceroute 2 sec timeout, 1

query/hop, <= 30hops• For some hosts use ICMP traceroute

– End host responds (7/40)– Intermediate host responds (1/40)

• Two cases UDP probes better than ICMP• One case neither ICMP or UDP probes help

– Both forward & reverse (use ssh for reverse route)

• Need ssh access to remote host for rev trace– Else no reverse route (not a disaster)

4

Significant changes• Compare current and previous traceroutes:

– If traceroute reports “unknown host” => unknown (!) – Else for each hop/node

• If both current & previous hops have valid IP addresses (i.e. router does not respond & traceroute reports “*”)

– If different i.e. some kind of Route Change has occurred

» If IPs same for 1st 3 octets then => same subnet/colo (:)

» Else if IPs in same AS then => same AS (a)» Else significant change => assign unique route number

If only one hop different => color route # orange ( ) Else color route => color route # red ( )

– Elseif 30 hops => no route change but last hop unreachable (|)» If last hop not pingable => color red (|)

– Else => no route change (●)• Elseif one or both IPs are “*” => route change unclear (*)

• If “Icmp checksum is wrong” color character orange• If significant bandwidth change color cell

5

Route table• Compact so can see many routes at once

History navigation

Multiple route changes (due to GEANT), later restored to original route

Available bandwidth

Raw traceroute logs for debugging

Textual summary of traceroutes for email to ISPDescription of route numbers with date last seen

User readable (web table) routes for this host for this day

Route # at start of day, gives idea of root stability

Mouseover for hops & RTT

6

Another example

TCP probe type

Host not pingable

Intermediate router does not

respondICMP checksum

error

Level change

Get AS information for routes

7

Topology• Choose times and hosts and submit request

DLCLRC

CLRC

IN2P3

CESnet

ESnet

JAnetGE

AN

TNodes colored by ISPMouseover shows node namesClick on node to see subroutesClick on end node to see its path backAlso can get raw traceroutes with AS’

Alternate rt

SLAC

Alternate routeHour of day

8

Available bandwidth• Uses ABwE/Abing (packet pair dispersion)

– Needs server at remote end or ssh to launch server– Fast (< 1 sec)– Lightweight

• < 40 packets for both forward & reverse estimates (5800 Bytes)

– Uses min delay for capacity– Inter packet dispersion for cross-traffic– Available BW = Capacity (min RTT) – Cross-traffic (var)

• Good agreement with other methods• Even if poor absolute agreement (25% cases) can spot changes

– Also provides RTT

• Make measurements to about 60 hosts at 5 minute intervals (deployed in IEPM, MonALISA, PlanetLab)

9

Available Bandwidth• From SLAC to Caltech Mar 19, 2004

Dynamic bandwidth capacity (DBC)

Available bandwidth =DBC – X-traffic

Cross-traffic

Iperf

10

Achievable throughput & file transfer

• IEPM-BW– High impact (iperf, bbftp, GridFTP …) measurements 90+-15 min intervals

Select focal area

Fwd route change

Rev route change

Min RTT

Iperf

bbftpiperf1

abing

Min RTT

11

Put it all together• Two examples

– Agreement of iperf & abing– Route changes and available bandwidth

12

AbWE

Iperf

28 days bandwidth history. During this time we can see several different situations caused by

different routing from SLAC to CALTECH

Drop to 100 Mbits/s by Routing (BGP) errors

Drop to 622 Mbits/s path

back to new CENIC path

New CENIC path 1000 Mbits/s

Reverse Routing changes

Forward Routing changes

Scatter plot graphs of Iperf versus ABw on different paths (range 20–800 Mbits/s) showing agreement of two methods

(28 days history)

RTT

BbftpIperf 1 stream

13

Changes in network topology (BGP) can result in dramatic changes in performance

Snapshot of traceroute summary table

Samples of traceroute trees generated from the table

ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am

Drop in performance(From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech )

Back to original path

Changes detected by IEPM-Iperf and AbWE

Esnet-LosNettos segment in the path(100 Mbits/s)

Hour

Rem

ote

host

Dynamic BW capacity (DBC)

Cross-traffic (XT)

Available BW = (DBC-XT)

Mbit

s/s

Notes:1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:002. ESnet/GEANT working on routes from 2:00 to 14:003. A previous occurrence went un-noticed for 2 months4. Next step is to auto detect and notify

Los-Nettos (100Mbps)

14

Automatic Step change Detection

• Too many graphs to review each morning!• Motivated by drop in bandwidth between SLAC &Caltech

– Started late August 2003– Reduced achievable throughput by factor of 5– Not noticed until October 2003– Caused by faulty routing over commercial network – After notifying ISP, it was fixed in 4 hours!– See http://www.slac.stanford.edu/grp/scs/net/case/caltech/ for details

SLAC Caltech achievable throughput April – November 2003 Started

15

Automatic available bandwidth step change detection

• Still developing, evolving from earlier work:– Arithmetic weighted moving averages– NLANR work, see http://

byerley.cs.waikato.ac.nz/~tonym/papers/event.pdf

• Roughly speaking:– Has a history buffer to describe past behavior

• History buffer duration currently 600 mins

– Plus a trigger buffer of data suggesting a change• Trigger buffer duration (evaluating typically 10-60 mins) indicates

how long the change has to occur for

– History mean () and std. dev. () use by trigger selector• If new_value outside +- sensitivity add to trigger buffer• If new_value outside +- 2*sensitivity then also an outlier (don’t

add to stats)• Else goes in history buffer

16

Algorithm

• If this is a trigger value compare with and save direction of change

• If this is a trigger and the direction has changed, reset trigger buffer– Move trigger data to history buffer, recalculate stats,

clear trigger buffer

• If trigger buffer full calculate trigger mean t and t

– If ( - t)/ threshold then a & reset trigger buffer

– Else remove oldest value from trigger buffer

17

ExamplesSLAC to Caltech available bandwidth April 6-8, 2004Alerts

History duration: 600 mins, trigger duration: 30 mins, threshold: 40%, sensitivity: 2With trigger duration: 60 only see one alert, with trigger duration: 10 catch alerts

Rou

te c

hang

e

SLAC to NIKHEF (Amsterdam)Mbit/sAvailBW

Route changesSLAC - NIKHEF

Unreachable

18

BW vs Route changes• Route & throughput changes from 11/28/03 thru 2/2/04

– Most (80%) route changes do not result in throughput change– About half throughput changes are due to route changes

Location (# nodes)

# route chgs

# with thru inc.

# with thru decr.

# thru chgs

# thru with rte

# thru chg w/o rte

Europe (8) 370 2 4 10 6 4

Canada & US (21)

1206 24 25 71 49 221

Japan (13) 142 2 2 9 4 5

19

More Information• ABwE:

– http://moat.nlanr.net/PAM2003/PAM2003papers/3781.pdf

• IEPM– http://www-iepm.slac.stanford.edu/– http://moat.nlanr.net/PAM2003/PAM2003papers/3768.pdf

• Traceroute examples:– www.slac.stanford.edu/comp/net/iepmlite/tracesummaries/to

day.html

• Step change analysis– http://byerley.cs.waikato.ac.nz/~tonym/papers/event.pdf