PerfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting...

download PerfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015.

If you can't read please download the document

description

perfSONAR v3.5 Toolkit  perfSONAR v3.5 released on the 28 th of September  Main themes for this release:  Support for central host management and node auto-configuration  Support for low cost nodes  Support for Debian, VMs, and other installation options  Modernize the GUIs  In addition v3.5 incorporates feedback and bugfixes from our WLCG/OSG deployments, improving robustness.  WLCG/OSG Deployment status as of today (great progress): Deployment statusDeployment status  : 6  : 23  3.5 : 2  : 195  Unknown: 18 (These nodes are either down or hung) LHCONE-Amsterdam3October 28, 2015

Transcript of PerfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting...

perfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015 Overview of Talk perfSONAR Changes and updates WLCG, LHCONE and LHCOPN infrastructure overview Status and changes in our meshes Some new tools ElasticSearch, MadAlert and topology explorations for our data Summary and Discussion October 28, 2015LHCONE-Amsterdam2 perfSONAR v3.5 Toolkit perfSONAR v3.5 released on the 28 th of September Main themes for this release: Support for central host management and node auto-configuration Support for low cost nodes Support for Debian, VMs, and other installation options Modernize the GUIs In addition v3.5 incorporates feedback and bugfixes from our WLCG/OSG deployments, improving robustness. WLCG/OSG Deployment status as of today (great progress): Deployment statusDeployment status : 6 : 23 3.5 : 2 : 195 Unknown: 18 (These nodes are either down or hung) LHCONE-Amsterdam3October 28, 2015 New perfSONAR Deployment Options Configuration managed deployments via bundles ( see) ttp://docs.perfsonar.net/install_options.html perfSONAR Tools (just tools) perfSONAR TestPoint (passive, no MA) perfSONAR Core (+MA) perfSONAR Complete (+Web and Toolkit Configuration) perfSONAR Central Management (MaDDash, Auto-config, Centralized config service) Low-cost nodes to support large-scale deployment (http://docs.perfsonar.net/low_cost_nodes.html ) $ range should enable broad deployment Small form factor enables more locations Some limitations in capabilities due to hardware VMs - Still not recommended but possible Target: whole node VMs, VMs with dedicated physical NICs Main use end-to-end infrastructure testing (not network) What about Docker? LHCONE-Amsterdam4October 28, 2015 Current perfSONAR Deployment LHCONE-Amsterdam5 278 perfSONAR instances registered in GOCDB/OIM 245 Active perfSONAR instances 197 Running latest version (3.5+) Initial deployment coordinated by WLCG perfSONAR TF Commissioning of the network followed by WLCG Network and Transfer Metrics WGfor stats https://www.google.com/fusiontables/DataSource?docid=1QT4r17HEufkvnqhJu24nIptZ66XauYEIBWWh5Kpa#map:id=3 October 28, 2015 Overview of perfSONAR Pipeline LHCONE-Amsterdam6 The diagram on the right provides a high-level view of how WLCG/OSG is managing our perfSONAR deployments, gathering metrics and making them available for use. We will cover some of the details in what follows October 28, 2015 Gathering & Storing Metrics OSG is providing network metric data for its members and WLCG via the Network Datastore The data is gathered from all WLCG/OSG perfSONAR instances Stored indefinitely on OSG hardware Made available via API In production since September 14 th The primary use-cases Network problem identification and localization Network-related decision support Network baseline: set expectations and identify weak points for upgrading LHCONE-Amsterdam7October 28, 2015 OSG Network Datastore Diagram LHCONE-Amsterdam8 q OSG is gathering relevant metrics from the complete set of OSG and WLCG perfSONAR instances q Operating now q Running VMs on dedicated hardware q Data also published to CERN Active MQ instance and available for user subscription q Actively tuning and debugging 8 VMs Storage must host 7 distinct areas October 28, 2015 Changes for LHCOPN/LHCONE We have changed to use uni-directional tests for OWAMP to reduce the load Source host is responsible for initiating and recording test results to each destination We are using iperf3 as the baseline for bandwidth measurements (adds retry information) Recent fix for NDT ensured the TCP congestion protocol would use htcp rather than reno when NDT and NPAD are not in use. This should improve BW results. Plan to gather all the LHCOPN and LHCONE data into ElasticSearch (ongoing) October 28, 2015LHCONE-Amsterdam9 Existing Test Coverage Current perfSONAR measurement coverage for WLCG/OSG: Full latency (one-direction only, 10Hz, OWAMP, IPv4) Full traceroute (bi-directional, hourly, BWCTL/OWAMP, IPv4, IPv6) Full bandwidth (one-direction only, fortnightly, BWCTL-only!, IPv4, IPv6) Regional meshes still disabled, need to discuss how to evolve We can create any sub-mesh of the full latency mesh (for free, but only IPv4 and using same params) We could move from regional to bigger meshes (European, Asia/Pacific, US) We can create new bandwidth meshes as bwclt needs fewer resources (but only for BWCTL-only nodes, not on dual-nodes) We re-enabled project meshes Belle II Belle II both latency and bandwidth Dual-stack Dual-stack just bandwidth (both IPv4 and IPv6) LHCONE/LHCOPN LHCONE/LHCOPN These need to be separately tracked LHCONE-Amsterdam10October 28, 2015 OMD for LHCONE/LHCOPN perfSONARs October 28, 2015LHCONE-Amsterdam11 https://maddash.aglt2.org/WLCGperfSONAR/check_mk/https://maddash.aglt2.org/WLCGperfSONAR/check_mk/ (Prototype) https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/ (Production) We monitor: Expected test coverage NDT/NPAD running? Memory on hosts ( CERN Active MQ -> Flume -> ES -> PANDA Prototype working and analytics being performed in Elastic Search to validate data (see following slide) Plan is to create a network source-destination cost-matrix PANDA can use to evaluate options Actual interface details being discussed with PANDA team Can also be used to analyze LHCONE/LHCOPN data! 21 perfSONAR Data into ElasticSearch Avg src loss Avg src loss % Avg dst loss Avg dst loss %for example plots using WLCG dataOctober 28, 2015LHCONE-Amsterdam22 MadAlert: A new project to analyze meshes Gabriele Carcassi has been working with me on creating a new utility to analyze meshes: MadAlert See details at You can see meshes and reports from the page infrastructurenetwork Reports find both infrastructure and network problems We are now working with Andy Lake/ESnet to incorporate this into the next major release of MaDDash (v2.0) Now testing a diff to allow us to compare meshes; e.g., IPv4 vs IPv6, testing vs production, mesh(t1) vs mesh(t2) Could be really helpful for understanding new software versions or changes in time. Time based comparison will require some modifications to MaDDash to allow specifying time-based meshes. October 28, 2015LHCONE-Amsterdam23 Understanding Network Topology Can we create tools to manipulate, visualize, compare and analyze network topologies from the OSG network datastore contents? Can we build upon these tools to create a set of next- generation network diagnostic tools to make debugging network problems easier, quicker and more accurate? Even without requiring the ability to perform complicated data analysis and correlation, basic tools developed in the area of network topology-based metric visualization would be very helpful in letting users and network engineers better understand what is happening in our networks. Last time I showed some examples potentially useful components to begin looking at network topology October 28, 2015LHCONE-Amsterdam24 Exploring Path Analysis LHCONE-Amsterdam25 latency, packet-loss, throughput DFN JANET GEANT RAL Aachen ITEP QMUL We can correlate paths with packet-loss/latency information We can simplify the graph by aggregating nodes that belong to same NREN (visual debugging) October 28, 2015 WLCG Support Unit Reminder: We have a GGUS support unit (WLCG Network Throughput; https://wiki.egi.eu/wiki/GGUS:WLCG_Network_Throughput) used to report incidents (mailing list: wlcg-network-throughput at cern.ch)https://wiki.egi.eu/wiki/GGUS:WLCG_Network_Throughput Experiments can report potential network performance incidents. WLCG perfSONAR support investigates and confirms if this is network related issue. Once confirmed, it will notify relevant sites and will try to assist in narrowing down the problem to particular link(s). Tracking of ongoing incidents will be via the WG page. Sites observing a network performance problem should follow their standard procedure, i.e. report to their network team and if necessary escalate to their network provider. If confirmed to be WAN related, WLCG perfSONAR support unit can assist in further debugging. For the non-technical (policy) issues, sites should escalate to the WLCG operations coordination. https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Perf ormance_Incidents. https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Perf ormance_Incidents LHCOPN/LHCONE experts are very important in this coordinated activity. October 28, 2015LHCONE-Amsterdam26 Next Steps We are working on getting ALL WLCG/OSG perfSONAR instances fully operational and properly configured We have hints that some perfSONAR services stop or hang under some circumstances. Working with developers to isolate/fix. Some hosts are underpowered (