MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

42
1 MAGGIE MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end Monitoring and Analysis for the Global Grid and Internet End-to-end performance performance Warren Matthews Stanford Linear Accelerator Center (SLAC)

description

MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance. Warren Matthews Stanford Linear Accelerator Center (SLAC). Abstract. - PowerPoint PPT Presentation

Transcript of MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

Page 1: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

1

MAGGIEMAGGIE

Monitoring and Analysis for the Global Grid and Internet End-to-end Monitoring and Analysis for the Global Grid and Internet End-to-end

performanceperformance

MAGGIEMAGGIE

Monitoring and Analysis for the Global Grid and Internet End-to-end Monitoring and Analysis for the Global Grid and Internet End-to-end

performanceperformance

Warren Matthews

Stanford Linear Accelerator Center (SLAC)

Page 2: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

2

AbstractAbstractAbstractAbstractThe ambitious distributed computing goals of data intensive science requires careful study of end-to-end performance across the networks involved.

Since 1995, the Internet End-to-end Performance Monitoring (IEPM) group at the Stanford Linear Accelerator Center (SLAC) has been trackingconnectivity between High Energy and Nuclear Physics (HENP) laboratories and their collaborating Universities and Institutes around the world.

In this talk, results from measurements will be presented. Long term trends will be discussed. In particular, the development of a largeend-to-end performance monitoring infrastructure involving automatic trouble-detection and notification will be featured.

Page 3: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

3

OverviewOverviewOverviewOverview

• Motivation for MAGGIE• High Performance Networks• Network Monitoring

– Results– Publishing– Trouble-shooting and Fault Finding

Page 4: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

4

Page 5: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

5

Page 6: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

6

MotivationMotivationMotivationMotivation

• High Energy and Nuclear Physics

• BaBar database contains ~1.5 billion particle physics event - over 750 TB

• Increasing at 100 events per second – 8 MBps

• 100s TB exported to BaBar centers and 100s TB Monte Carlo Simulations Imported

• LHC will be an order of magnitude larger

• Future of HENP is distributed data grid

Page 7: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

7

More MotivationMore MotivationMore MotivationMore Motivation

• Also other data intensive science• Astronomy, genetics

• Other demanding applications• High-Res medical scans• Video-on-demand

• Other fields• Digital Divide• Malaria Centers in Africa, SARS, AIDS.

Page 8: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

8

High Performance High Performance NetworksNetworks

High Performance High Performance NetworksNetworks

• SLAC has 2xOC12 (622Mbps) connections to Energy Sciences Network (ESnet) and California Research and Education Network (CALREN)

• ESnet provides connectivity to labs, commercial and international

• CALREN provides connectivity to UC sites and Abilene

• High capacity well engineered networks

• Bandwidth is required but not sufficient

Page 9: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

9

This image taken from the ESnet web site

Page 10: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

10

Abilene BackboneAbilene Backbone

• PDF Map on Internet2 WebSiteThis image taken from the Internet2 web site

Page 11: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

11

Monitoring Projects Monitoring Projects (1/2)(1/2)

Monitoring Projects Monitoring Projects (1/2)(1/2)

• Active (and over-active)• PingER/HEP (SLAC, FNAL)

• PingER/eJDS (SLAC, ICTP)

• AMP and AMP-IPV6 (NLANR)

• RIPE-TT (RIPE)

• Surveyor (Internet2, Wisconsin)

• NASA

• IEPM-BW (SLAC, FNAL)

• NIMI (ICIR, PSC)

• MAGGIE (ICIR, PSC, SLAC, LBL, ANL)

Page 12: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

12

Monitoring Projects Monitoring Projects (2/2)(2/2)

Monitoring Projects Monitoring Projects (2/2)(2/2)

• Passive• Netflow (Cisco, IETF)

• SCNM (LBNL)

• IPEX (XIWT, Telcordia)

• NetPhysics

• Also home-grown system.

Page 13: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

13

End-to-end MonitoringEnd-to-end MonitoringEnd-to-end MonitoringEnd-to-end Monitoring

• In reality most projects measure End-to-end performance• End-host effects are unavoidable

• Internet2 End-to-end Performance Initiative • Most useful to users

• Performance Evaluation System (PIPES)

• MAGGIE

Page 14: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

14

MAGGIEMAGGIEMAGGIEMAGGIE

MAGGIE

NIMISecurity and scheduling

IEPM-BWMeasurement Engine

Publishing

Fault FindingAnalysis Engine

Other tools

NMWG

AMP

RIPESLAC

SLAC

FNAL

PSCICIR

LBNL

SLAC

ANLSCIDAC

UCL

Page 15: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

15

IEPM-BWIEPM-BWIEPM-BWIEPM-BW

• SLAC package for monitoring and analysis

• Currently 10 monitoring sites• SLAC, FNAL, GATech (SOX), INFN

(Milan), NIKHEF, APAN (Japan)

• UMich, Internet2 (Michigan), UManchester, UCL (Both UK)

• 2-36 targets

Page 16: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

16

SNV

SLAC

CHI

ESnetNY

Stanford

CalREN

NERSC

LANL

JLAB

TRIUM

F

KE

K

Abilene

SLAC

SNV

FNAL

ANLNIK

HE

F

CERN

IN2P3

CERN

CA

LT

EC

H

SDSC

BNLJAnet

HSTN

SEA

ATL

CLVIPLS

RAL

UCL UManc

DLNNW

NY

RiceUTDallas

NCSAUMich I2

SOX

UFL

APANRIKEN INFN-Roma

INFN-Milan

CESnet

APANGeant

EDGP

PD

G/G

riP

hyN

Monitoring S

ite

ORNL

Page 17: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

17

Measurement EngineMeasurement EngineMeasurement EngineMeasurement Engine

• Ping, Traceroute

• Iperf, Bbftp, Bbcp (mem and disk)

• Abwe

• Gridftp, UDPmon

• Web100

• Passive (netflow)

Page 18: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

18PingER project has been tracking ping times to HEP collaborators since early 1995

Page 19: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

19

Page 20: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

20

Throughput from SLAC to RAL between May 2002 and February 2003

0

50000

100000

150000

200000

250000

5/13/20025/27/20026/10/20026/24/20027/8/2002

7/22/20028/5/2002

8/19/20029/2/2002

9/16/20029/30/200210/14/200210/28/200211/11/200211/25/2002

12/9/200212/23/2002

1/6/20031/20/2003

2/3/20032/17/2003

iperf

bbcpmem

bbcpdisk

bbftp

Page 21: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

21

Available Bandwidth Estimate between SLAC and Caltech in February 2003

0

20

40

60

80

100

120

140

160

180

200

2/25/2003 0:00 2/25/2003 12:00 2/26/2003 0:00 2/26/2003 12:00 2/27/2003 0:00

Bandwidth in Mbps

Page 22: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

22

TrafficTrafficTrafficTraffic

Typically, Internet traffic is 70% http

Page 23: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

23

Conclusions from IEPM-Conclusions from IEPM-BWBW

Conclusions from IEPM-Conclusions from IEPM-BWBW

• Bbftp vs bbcp => Implementation

• Iperf vs bbftp => Disk, CPU

• Packet loss < 0.1%

• TCP/IP parameters must be tuned

• Web 100

• FAST, Tsunami

• LSR

Page 24: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

24

PublishingPublishingPublishingPublishing

• Usual method is on the web• Too much to review frequently

• Also time delay• Want to resolve problems before users

complain

• Alarm System based on Web Services• GGF NMWG/OGSA

Page 25: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

25

DemoDemoDemoDemo

• Web service is fully described by WSDL– http://www-iepm.slac.stanford.edu/tools/soap/MAGGIE.html

• Path.delay.oneWay (Demo)

Page 26: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

26

TroubleshootingTroubleshootingTroubleshootingTroubleshooting

• RIPE-TT Testbox Alarm

• AMP Automatic Event Detection

• Our approach is diurnal changes

Page 27: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

27

Diurnal Changes (1/2)Diurnal Changes (1/2)

• Parameterize performance in terms of hour and variability within that hourly bin– Median and standard deviation of

measurements on Monday 7pm-8pm

• AMP uses mean and variance• RIPE-TT uses rolling average and

breaks day into 4

Page 28: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

28

Diurnal Changes (2/2)Diurnal Changes (2/2)

• Measurements can be classified in terms of how they differ from historical value

– “Concerned” if latest measurement is more than 1 s.d. from median

– “Alarmed” if latest measurement is more than 2 s.d. from median

• Recent problems are flagged due to difference from historical value

• Compare to measurement in previous bin (e.g. Monday 6pm-7pm) to reduce false-positives

Page 29: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

29

LimitationsLimitationsLimitationsLimitations

• Could be over an hour before alarm is generated

• Need more frequent but sufficiently low impact measurements to allow finer grained troubleshooting

• Migrating to ABWE

Page 30: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

30

Trouble DetectionTrouble DetectionTrouble DetectionTrouble Detection

$ tail maggie.log04/28/2003 14:58:47 (1:14) gnt4 0.51 Alarm (AThresh=38.33)04/28/2003 16:25:45 (1:16) gnt4 3.83 Concern (CThresh=87.08)04/28/2003 17:55:21 (1:17) gnt4 169.57 Within boundaries

Date and Time Bin Node Throughput (iperf) Status

Only write to the log if an alarm is triggeredKeep writing to the log until alarm is cleared

Page 31: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

31

Trouble StatusTrouble StatusTrouble StatusTrouble Status

• Tempted to make color-coded web page

• All the hard work still left to do

• Use knowledge to see common point of

failure

• Production table would be very large

• Instead figure out where to flag

Page 32: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

32

Net RatNet RatNet RatNet Rat

• Inform on possible problem locations– Starting point for human intervention

• No measurement is ‘authoritative’– Cannot even believe a measurement – Multiple tools and Multiple

measurement point - Cross reference– Trigger further measurements (NIMI)

Page 33: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

33

Net Rat Methodology Net Rat Methodology (1/4)(1/4)

Net Rat Methodology Net Rat Methodology (1/4)(1/4)

• If last measurement was Within 1sd

• Mark each hop as Good

• Hop.performance = good

• If last measurement was “Concern”

• Mark each hop as acceptable

• If last measurement was an “Alarm”

• Mark Each hop as poor

Page 34: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

34

Net Rat Methodology Net Rat Methodology (2/4)(2/4)

Net Rat Methodology Net Rat Methodology (2/4)(2/4)

• Measurement generates an alarm

• Set each hop.performance = poor

Page 35: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

35

Net Rat Methodology Net Rat Methodology (3/4)(3/4)

Net Rat Methodology Net Rat Methodology (3/4)(3/4)

• Other measurements from same site do not generate alarms.

• Set each hop.performance = good

• Immediately ruled out problem in local LAN or host machine

Page 36: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

36

Net Rat Methodology Net Rat Methodology (4/4)(4/4)

Net Rat Methodology Net Rat Methodology (4/4)(4/4)

• Different site monitors same target

• No alarm is generated• Set each

hop.performance = good• Pinpointed possible

problem in intermediate network.– Of course it couldn’t be that simple

Page 37: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

37

ArenaArenaArenaArena

• Report findings to informant database

• Internet2 Arena database• PingER Nodes database• PIPES Culprit/Contact Database

Page 38: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

38

Toward a Monitoring Toward a Monitoring

InfrastructureInfrastructure

Toward a Monitoring Toward a Monitoring

InfrastructureInfrastructure• Certainly the need

– DOE Science Community– Grid– Troubleshooting / E2Epi

• Many of the ingredients– Many monitoring projects– Many tools– PIPES– MAGGIE

Page 39: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

39

SummarySummary

“It is widely believed that a ubiquitous monitoring

infrastructure is required”.

Page 40: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

40

LinksLinksLinksLinks

• IEPM-BW• ESnet• ABwE• AMP• NIMI• RIPE-TT

• E2E PI• SLAC Web Services• GGF NMWG• Arena• AMP TroubleShooting

Page 41: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

41

CreditsCreditsCreditsCredits• Les Cottrell• Connie Logg, Jerrod Williams• Jiri Navratil• Fabrizio Coccetti• Brian Tierney• Frank Nagy, Maxim Grigoriev• Eric Boyd, Jeff Boote• Vern Paxson, Andy Adams• Iosif Legrand• Jim Ferguson, Steve Englehart• Local admins and other volunteers• DoE/MICS

Page 42: MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

42

Output from the demo on Output from the demo on slide 25

Output from the demo on Output from the demo on slide 25

% ./soap_client.pl ripe-tt20030628215739.911553978920.075347

#!/usr/bin/perl

use SOAP::Lite;

my $answer = SOAP::Lite -> service('http://www-iepm.slac.stanford.edu/tools/soap/wsdl/profile_07.wsdl') -> pathDelayOneWay("tt81.ripe.net:tt28.ripe.net","");

print $answer->{NetworkTestTool}->{toolName},"\n";print $answer->{NetworkTestInfo}->{time},"\n";print $answer->{NetworkPathDelayStatistics}->{value},"\n";