perfSONAR · Measurement Measuring network performance and monitoring network components are a...
Transcript of perfSONAR · Measurement Measuring network performance and monitoring network components are a...
Agenda • Mo*va*on • What is perfSONAR? • Suggested Deployment for Campus/Regional
• Networks are an essen*al part of data-‐intensive science – Connect data sources to data analysis – Connect collaborators to each other – Enable machine-‐consumable interfaces to data and analysis resources (e.g. portals), automa*on, scale
• Performance is cri*cal – Exponen*al data growth – Constant human factors – Technology changes/improvements/paradigm shiMs – Data movement and data analysis must keep up
• Effec*ve use of wide area (long-‐haul) networks by scien*sts has historically been difficult
Mo*va*on
Measurement
Measuring network performance and monitoring network components are a critical part of any high-performance network deployed today. In depth network measurement and monitoring services are key components to provide researches and engineers with views into application performance and to trouble shoot network problems.
Network Monitoring • All networks do some form monitoring.
• Addresses needs of local staff for understanding state of the network o Would this informa*on be useful to external users? o Can these tools func*on on a mul*-‐domain basis?
• Beyond passive methods, there are ac*ve tools. o E.g. oMen we want a ‘throughput’ number. Can we automate that idea?
o Wouldn’t it be nice to get some sort of plot of performance over the course of a day? Week? Year? Mul*ple endpoints?
• Where is the “Measurement Middleware”? Something to allow for the easy exchange of metrics that are collected locally, on a global scale?
SoM Failures • SoM failures are where basic connec*vity func*ons,
but high performance is not possible. • TCP was inten*onally designed to hide all
transmission errors from the user: – “As long as the TCPs con*nue to func*on properly and
the internet system does not become completely par**oned, no transmission errors will affect the users.” (From IEN 129, RFC 716)
• Some soM failures only affect high bandwidth long RTT flows.
• Hard failures are easy to detect & fix • soM failures can lie hidden for years! • SoM failures can be present on the host, protocol, applica*on, or network
• One network problem can oMen mask others – this is common
Where Are The Problems?
• Source • Campus • Backbone
• S
• NREN
• Congested or faulty links between domains
• Congested intra-‐ campus links
• D
• Des*na*on • Campus
• Latency dependant problems inside domains with small RTT
• Regional
• Source • Campus
• R&E • Backbone
• Regional
• D • S
• Des8na8on • Campus
• Regional
• Performance is good when RTT is < ~10 ms
• Performance is poor when RTT exceeds ~10 ms
• Switch with small buffers
Local Tes*ng Will Not Find Everything
Agenda • Mo*va*on • What is perfSONAR? • Suggested Deployment for Campus/Regional
What is perfSONAR? • perfSONAR is a tool to: • Set network performance expecta*ons for a variety of use cases • Find network problems (“soM failures”) & help fix these problems • Mi*gate the risks that are associated with the R&E environment (e.g. get
out in front of problems before its too late) • All in mul*-‐domain environments • These problems are all harder when mul*ple networks are involved –
need a mechanism to stop ‘finger poin*ng’ and get real work done • perfSONAR is provides a standard way to publish ac:ve and passive monitoring data
– This data is interes*ng to network researchers as well as network operators – This is the measurement middleware – a way to *e together local and end-‐to-‐
end measurements – A way to separate a network problem from that of an applica*on or host
• 10 – ESnet Science Engagement ([email protected]) - 5/6/14
What is perfSONAR (cont.)
• perfSONAR is an infrastructure for network performance monitoring.
• It is a services oriented architecture delivering performance measurements in a federated environment.
• It is an intermediate layer between the performance measurement tools and the diagnostic or visualization applications.
• A methodology for monitoring network connections that span multiple administrative domains.
• Partners include: GEANT2, ESNET, I2, RNP • http://www.perfsonar.net/"
perfSONAR Present
Lookup Service Directory Search: hfp://stats.es.net/ServicesDirectory/
Lookup Service
• Services register their existence and capabilities with a LS.
• Clients discover services by querying the LS. • LS are found by multicast, well-known servers,
local configuration, or other LSs. • The LS are queried on attributes (service type,
authentication) and more complex constructs (network location) not simply named-based.
Measurement Archive Service
• Measurement Archives store data in databases and publish data produced by MPS (or TSs). • They also provide a historical record of analysis. • Reduces queries to the MPS by publishing to multiple clients. • As a server, it accepts and stores setup and publication requests. • As a client, it registers with an LS and subscribes to a MPS, other MAS and publishes data to subscribers.
• The ToS is a specific example of a TS used to make topological information available to the framework.
• Understanding topology is necessary for the measurement system to optimize its operations (closest nodes).
• ToS may also be used for overviews/maps clients to present measurement data.
Topology Service
Measurement Point Service
• MPS creates and publishes data by initiating active measurements or querying passive devices. • A setup protocol allows users to request measurements and publish the results. • As a server, the MPS accepts requests and publishes the data (client subscriber handle must be known in advance). • As a client, the MPS registers with the LS and publishes to subscribers.
perfSONAR-PS Services
• Focus on development of major perfSONAR components – SNMP Based MP/MA – Lookup Service – Topology – Link Status New additions – OWAMP/BWCTL – Traceroute – Pinger (SLAC+Fermilab) – Visualization (Perfsonar UI plugins + meter)
SNMP Based MP/MA
• Deployed – Internet2 Network – ESNet – Georgia Tech/SLAC/University of Delaware – All over
• Compatible with perfSONAR-UI • CPAN package in development
Pinger Based MP/MA
• Joint effort between Fermi Lab and SLAC"• Present views of historic Pinger data"• Expose interface to schedule live tests"
• Development and integration into perfSONAR-PS based on LHC-OPN requirements"
Visualization
• Utilizing the plugin architecture of perfSONAR-UI"
• Data visualization beyond network utilization"• Google Maps"
• Utilization by physical location"• 'Weather Map' of Internet2 Network"
• Web based speedometer to interact directly with MA code"• Maddash"
Other services in development
• Topology/LS service"• UNIS development (Indiana University)"
• Maddash//mesh"• Ease full mesh deployment"
• OWAMP MA"• Coordinate regular scheduled tests with BWCTL"
• BWCTL MA"• Coordinate regular scheduled tests with OWAMP"
Agenda • Mo*va*on • What is perfSONAR? • Suggested Deployment for Campus/Regional
• The “perfSONAR Toolkit” is an open source implementa*on and packaging of the perfSONAR measurement infrastructure and protocols – everything you (or your scien*sts) needs to get a baseline and start addressing true problems
• hfp://psps.perfsonar.net/toolkit • All components are available as RPMs, and bundled into a CentOS 6-‐based “ne*nstall” and a “Live CD” • perfSONAR tools are much more accurate if run on a dedicated perfSONAR host, not on the DTN.
• Very easy to install and configure • Usually takes less than 30 minutes
perfSONAR Toolkit
• We can’t wait for users to report problems and then fix them (soM failures can go unreported for years!)
• Things just break some*mes – Failing op*cs – Somebody messed around in a patch panel and kinked a fiber – Hardware goes bad
• Problems that get fixed have a way of coming back – System defaults come back aMer hardware/soMware upgrades – New employees may not know why the previous employee set things up a certain way and back out fixes
• Important to con*nually collect, archive, and alert on ac*ve throughput test results
Importance of Regular Tes*ng
Regular perfSONAR Tests • We run regular tests to check for two things
– TCP throughput – One way delay and packet loss
• perfSONAR has mechanisms for managing regular tes*ng between perfSONAR hosts – Sta*s*cs collec*on and archiving – Graphs – Dashboard display – Integrate with NAGIOS
• This infrastructure is deployed now – perfSONAR hosts at facili*es can take advantage of it
• At-‐a-‐glance health check for data infrastructure
• perfSONAR Dashboard: hfp://ps-‐dashboard.es.net
• What are you going to measure? – Achievable bandwidth
• 2-‐3 regional des*na*ons • 4-‐8 important collaborators • 4-‐8 (more if you are willing, especially to start) *mes per day to each des*na*on
• 20-‐30 second tests within a region, longer across oceans and con*nents
– Loss/Availability/Latency • OWAMP: ~10-‐20 collaborators over diverse paths
– Interface U*liza*on & Errors (via SNMP) • Guidance on servers to buy: • hfp://psps.perfsonar.net/toolkit/hardware.html • Virtualiza*on is tricky, recommended to go dedicated
hardware.
Develop a Test Plan
perfSONAR Deployment Loca*ons • Cri*cal to deploy such that you can test with useful seman*cs • perfSONAR hosts allow parts of the path to be tested separately
– Reduced visibility for devices between perfSONAR hosts – Must rely on counters or other means where perfSONAR can’t go
• Effec*ve test methodology derived from protocol behavior – TCP suffers much more from packet loss as latency increases – TCP is more likely to cause loss as latency increases – Tes*ng should leverage this in two ways
• Design tests so that they are likely to fail if there is a problem • Mimic the behavior of produc*on traffic as much as possible
– Note: don’t design your tests to succeed • The point is not to “be green” even if there are problems • The point is to find problems when they come up so that the problems are
fixed quickly
Sample Site Deployment