Managing and monitoring large scale data transfers - Networkshop44
Transcript of Managing and monitoring large scale data transfers - Networkshop44
![Page 1: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/1.jpg)
Managing and monitoring large scale data transfers(WLCG FTS service as an
example)
Brian Davies
![Page 2: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/2.jpg)
Managing and monitoring large scale data transfers
(WLCG FTS service as an example)
Brian DaviesNetworkshop44
22/03/16
![Page 3: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/3.jpg)
Outline• Outline of the data transfer monitoring• What is the File Transfer Service (FTS)• Monitoring at different levels
– Central FTS data transfer monitoring– Virtual Organisation (VO) specific – User Monitoring
• Federated Failover• Use of “generic” monitoring tools
– Site Monitoring in conjunction with VO monitoring
![Page 4: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/4.jpg)
WLCG Has a lot of Data transfers to monitor
• 167 Sites in 43 Countries on six Continents• Storage endpoints containing 250PB (disk) 300PB (tape)
– Organised and chaotic access– Supporting Single/Multiple endpoints for Single/Multiple Virtual Organisations– Vary in size and scope
• 10TB-10s of PB of Total Storage (Disk and Tape)• 1/10 GE NICs, 1/10/100 Gbps, R&E networks and private OPN• 10TB-1PB filesystems/object stores, 1-300 diskservers per site• Multiple filesystems (XFS,HDFS,CEPH,GPFS,Lustre)
• Central Production and User initiated• Last two years WLCG has moved 0.5EB of data
– Over 1billion files.• WN jobs produce a lot of data which also has to be stored/moved
– One VO runs 200k concurrent jobs which last 10mins to 72 Hours.– 0-100s of Input files, 2-3 Output files
• Individual file open times 1-10000s
![Page 5: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/5.jpg)
Transfers to a single site/1day/1VO
![Page 6: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/6.jpg)
Easily fill our networks*
*Not all the time
![Page 7: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/7.jpg)
Data movements vary greatly• File size from ~10B to ~10GB• Latency between hosts from 0.1ms to 350ms (just for the UK )• Different workflows require different data movement
– WAN SE<->SE, SE->WN, WN->SE– LAN WN<->SE, SE<->SE
• Different Tools to monitor different workflows• Different storage middleware
– Native gridFTP, BeSTMan, DPM, dCache, SToRM• Different transfer protocols
– gsiFTP, http/WebDaV, xrootd, NFSv4.1, S3
![Page 8: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/8.jpg)
• EGI Middleware Stack• Can handle many VOs
– 22 (HEP and non-HEP) • Checksum validation of files• Retry of failed transfers• Auto-optimisation of transfer parameters to maximise throughput• Ability to set limits suitable for varied storage setups• Web friendly GUI also available!! Federated Failover
– Mainly use Command line tools or higher level control systems.• Handle many file transfers (~1.5M a day)
– Single to thousands of files per single submission
File Transfer Service (FTS) Moves data!
![Page 9: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/9.jpg)
Web GUI
![Page 10: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/10.jpg)
• Overview of all transfers to see problematic sites is needed– But also be need ability to look at individual transfers
• Web GUIs, reading log files– Even have web GUIs which parse log files
• People using the monitoring Vary: – Site Admins ,regional support, VO users, Middleware developers.
• Management and technical– Different systems work well for different use cases.
• What is of interest?– Do transfers complete or fail?– How Fast do they complete?
• How can I tell if my changes improve/worsen the system.
Monitoring at different Levels
![Page 11: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/11.jpg)
Central FTS Monitoring (dashboards and server GUIs)
Three Main VOs usage varies
![Page 12: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/12.jpg)
Overview to see if single site is having issues
![Page 13: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/13.jpg)
View smaller selections…Able to make sub-selections to diagnose problems not a the world scale
![Page 14: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/14.jpg)
Comparison between inter-SE rates
Sites want to know if they are better than their collaborators/competitors
![Page 15: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/15.jpg)
Ability to delve into greater detail at the server level
Many imbedded links to further monitoring
![Page 16: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/16.jpg)
Down to individual transfers
![Page 17: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/17.jpg)
To the log file
![Page 18: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/18.jpg)
Which VO can then re-interpret
![Page 19: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/19.jpg)
Transfer optimisation within FTS to increase individual transfers
![Page 20: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/20.jpg)
Listing Errors (Helps find most important errors to solve)
![Page 21: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/21.jpg)
Single failure mode failed transfers file list
![Page 22: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/22.jpg)
History of a single file
![Page 23: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/23.jpg)
Dedicated transfers to monitor rates
![Page 24: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/24.jpg)
Users Gather their own information
• But systems change which breaks the monitoring.
![Page 25: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/25.jpg)
AAA et al for federated failover • VOs each have their own system (AAA/FAX)
– But do similar actions– Copies data from remote storage if local copy does not exist to WN
• Allows for storage-less sites to be used.• Helps to reduce failures caused by local storage related issues.• Hierarchical Redirection
– Local->regional->continental->Global (or another convention)
![Page 26: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/26.jpg)
Example of global network
![Page 27: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/27.jpg)
FAX backup transfer mechanism also monitored
![Page 28: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/28.jpg)
• Outline of the scale data transport issue for WLCG• What is the File Transfer Service (FTS)• Monitoring at different levels
– Central FTS data transfer monitoring– VO specific – User Monitoring
• Federated Failover• Use of “generic” monitoring tools
– Site Monitoring in conjunction with VO monitoring
![Page 29: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/29.jpg)
Generic network monitoring tools
• Sites have access to established programs– Ping, traceroute, tracepath, ganglia, iftop, cacti
• Organising host testing and port blocking can be troublesome• Separate “off the shelf” hardware and monitoring
– http://atlas.ripe.net • perfSONAR toolkit
![Page 30: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/30.jpg)
• Goals: – Find and isolate “network” problems; alerting in time– Characterize network use (base-lining) – Provide a source of network metrics for higher level services
• Choice of a standard open source tool: perfSONAR– Benefiting from the R&E community consensus
• Tasks achieved:– Finalized core deployment and commissioned perfSONAR network – Monitoring in place to create a baseline of the current situation between sites– Developed test coverage and made it possible to run “on-demand” tests to
quickly isolate problems and identify problematic links
Shawn McKee UoM
Overview of perfSONAR in WLCG/OSG
![Page 31: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/31.jpg)
• End-to-end network issues are difficult to spot and localize – Network problems are multi-domain, complicating the process– Standardizing on specific tools and methods allows groups to focus resources more effectively and
better self-support– Performance issues involving the network are complicated by the number of components involved
end-to-end. • perfSONAR provides a number of standard metrics we can use• Latency measurements provide one-way delays and packet loss metrics
– Packet loss is almost always very bad for performance• Bandwidth tests measure achievable throughput and track TCP retries (using Iperf3)
– Provides a baseline to watch for changes; identify bottlenecks• Traceroute/Tracepath track network topology
– All measurements are only useful when we know the exact path they are taking through the network. – Tracepath additionally measures MTU but is frequently blocked
Shawn McKee UoM
Importance of Measuring Our Networks
![Page 32: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/32.jpg)
Current perfSONAR Deployment
246 Active perfSONAR instances202 Running latest version (3.5+)- 95 sonars in latency mesh
- 8930 links measured at 10Hz
- packet-loss, one-way latency, jitter, ttl, packet-reordering
- 115 sonars in traceroutes mesh - 13110 links - hourly traceroutes, path-mtu
- 102 sonars in bandwidth mesh- 10920 links (iperf3)
Shawn McKee UoM
https://www.google.com/fusiontables/DataSource?docid=1QT4r17HEufkvnqhJu24nIptZ66XauYEIBWWh5Kpa#map:id=3
![Page 33: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/33.jpg)
Generic network monitoring tools
• Sites have access to established programs– Ping, traceroute, tracepath, ganglia, iftop, cacti
• Organasing bi-directional host testing and port blocking can be troublesome• Separate “off the shelf” hardware and monitoring
– http://atlas.ripe.net • perfSONAR toolkit
![Page 34: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/34.jpg)
Overview Dashboards
![Page 35: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/35.jpg)
Dedicated monitoring Tools for the TCP layer
![Page 36: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/36.jpg)
Central Service Monitoring
![Page 37: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/37.jpg)
Analysis of the results garners useful information
![Page 38: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/38.jpg)
Range of connections and rates on single host
![Page 39: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/39.jpg)
Comparison between hosts at a single site
![Page 40: Managing and monitoring large scale data transfers - Networkshop44](https://reader035.fdocuments.us/reader035/viewer/2022070515/587986111a28ab6c358b6635/html5/thumbnails/40.jpg)
Conclusions• We have a lot of data to move (but successfully do so.)
– In many workflows• FTS is a method for how to do it.• Federated failover
– Automatic retries at multiple levels helps make problem transparent to the user• Lots of monitoring to ensure both a high success rate of transfers and a
high throughput both per file and overall.– Monitoring needs to be done at multiple levels
• Generic monitoring tools also useful. • Thank You