PANDA: Networking Update Kaushik De Univ. of Texas at Arlington SC15 Demo November 18, 2015.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
-
Upload
gary-mcdowell -
Category
Documents
-
view
218 -
download
0
Transcript of PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
![Page 1: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/1.jpg)
PanDA Status Report
Kaushik DeUniv. of Texas at Arlington
ANSE Meeting, NashvilleMay 13, 2014
![Page 2: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/2.jpg)
Overview
We are nearing end of ANSE project ~6 months Review goals/scope of PanDA work in ANSE Assess progress so far
PanDA work started ~1 year ago
Plans for completion of current work Plans for new work
Discuss tomorrow
Synergy with other projects Artem is co-funded by DOE-ASCR BigPanDA project BigPanDA continues for ~9 months after ANSE ends What happens after 2015?
May 13, 2014Kaushik De 2
![Page 3: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/3.jpg)
PanDA Goals
Explicit integration of Networking with PanDA Never before attempted for any WMS PanDA has many implicit assumptions about networking Goal 1: Use network information directly in PanDA workflow Goal 2: Attempt direct control (provisioning) through PanDA
ANSE + DOE-ASCR Picked few well defined topics Set up infrastructure and interactions with other projects Develop and deploy software Evaluation metrics
Deliver new capabilities for LHC experiments This is not only R&D – use in production environment
May 13, 2014Kaushik De 3
![Page 4: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/4.jpg)
PanDA Steps
Collect network information Storage and access Using network information Using dynamic circuits
May 13, 2014Kaushik De 4
![Page 5: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/5.jpg)
Sources of Network Information
DDM Sonar measurements Actual transfer rates for files between all sites (Tier 1 and Tier 2) This information is normally used for site white/blacklisting Measurements available for small, medium, and large files
perfSonar (PS) measurements perfSonar provides dedicated network monitoring data All WLCG sites are being instrumented with PS boxes US sites are already instrumented and monitored
Federated XRootD (FAX) measurements Read-time of remote files are measured for pairs of sites
This is not an exclusive list – just a starting point
May 13, 2014Kaushik De 5
http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteview#currentView=Sonar&highlight=false
![Page 6: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/6.jpg)
DDM Sonar
May 13, 2014Kaushik De 6
![Page 7: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/7.jpg)
perfSonar
May 13, 2014Kaushik De 7
![Page 8: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/8.jpg)
FAX
May 13, 2014Kaushik De 8
![Page 9: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/9.jpg)
May 13, 2014Kaushik De 9
![Page 10: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/10.jpg)
Data Repositories
Three levels of data storage and access Native data repositories
Historical data stored from collectors SSB – site status board for sonar and perfSonar data FAX data is kept independently and uploaded
AGIS (ATLAS Grid Information System) Most recent / processed data only – updated periodically Mixture of push/pull – moving to JSON API (pushed only)
schedConfigDB Internal Oracle DB used by PanDA for fast access Uses standard ATLAS collector
May 13, 2014Kaushik De 10
![Page 11: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/11.jpg)
May 13, 2014Kaushik De 11
![Page 12: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/12.jpg)
Using Network Information
Pick a few use cases Important to PanDA users Enhance workload management through use of network Should provide clear metrics for success/failure
Case 1: Improve User Analysis workflow Case 2: Improve Tier 1 to Tier 2 workflow
May 13, 2014Kaushik De 12
![Page 13: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/13.jpg)
Improving User Analysis
In PanDA, user jobs go to data Typically, user jobs are IO intensive – hence constrain jobs to data Note - almost any user payload is allowed by PanDA User analysis jobs are routed automatically to T1/T2 sites
For popular data, bottlenecks develop If data is only at a few sites, user jobs have long wait times PD2P was implemented 3 years ago to solve this problem Additional copies are made asynchronously by PanDA Waiting jobs are automatically re-brokered to new sites But bottlenecks still take time to clear up
Can we do something else using network information? Why not use FAX? First we need to develop network metrics for efficient use of FAX
May 13, 2014Kaushik De 13
![Page 14: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/14.jpg)
Faster User analysis through FAX
First use case for network integration with PanDA PanDA brokerage will use concept of ‘nearby’ sites
Calculate weight based on usual brokerage criteria (availability of CPU, release, pilot rate…)
Add network transfer cost to brokerage weight Jobs will be sent to the site with best weight – not necessarily the
site with local data If nearby site has less wait time, access the data through FAX
May 13, 2014Kaushik De 14
![Page 15: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/15.jpg)
First Tests
Tested in production for ~1 day in March, 2014 Useful for debugging and tuning direct access infrastructure We got first results on network aware brokerage
Job distribution 4748 jobs from 20 user tasks which required data from congested
U.S. Tier 1 site were automatically brokered to U.S. Tier 1/2 sites
May 13, 2014Kaushik De 15
120417 555
837
408660366
558
123030
41730
30128
30 30 30 30 30
Number of Jobs per Task
![Page 16: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/16.jpg)
Brokerage Results
May 13, 2014Kaushik De 16
553 566 568 569 570 571 573 574 598 605 615 617 622 640 647 655 662 665 668 6811
10
100
1000
10000
FAX/non-FAX Ratio
# of Local Jobs
# of Remote Jobs
Task Number
553 566 568 569 570 571 573 574 598 605 615 617 622 640 647 655 662 665 668 6810
100
200
300
400
500
600
700
Job Wait Times
Local Jobs Wait Time
Remote Jobs Wait Time
Task Number
![Page 17: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/17.jpg)
Conclusions for Case 1
Network data collection working well Additional algorithms to combine network data will be tried HC tests working well – but PS data not robust yet
PanDA brokerage worked well Achieved goal of reducing wait time Well balanced local vs remote access Will fine tune after more data on performance
Waiting for final implementation But we have no data on actual performance of successful jobs Need to test and validate sites for this mode of data access First tests in March had 100% failure rate (FAX deployment related) Second test 1 week ago also did not go well Expect third test soon
May 13, 2014Kaushik De 17
![Page 18: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/18.jpg)
Managing Data Rates
Tests have shown direct access rates need to be managed Parameters for WAN throttling implemented in PanDA
Throttling at brokerage level is easy (eg. ratio FAX jobs/non FAX jobs), but does not guarantee throttling during execution
Throttling during dispatch is not scalable when million jobs are dispatched daily (scale may be higher in the future)
Throttling may also be done at pilot level PanDA has implemented a mixed approach to throttling, being
tested now
May 13, 2014Kaushik De 18
![Page 19: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/19.jpg)
Cloud Selection
Second use case for network integration with PanDA Optimize choice of T1-T2 pairings (cloud selection)
In ATLAS, production tasks are assigned to Tier 1’s Tier 2’s are attached to a Tier 1 cloud for data processing Any T2 may be attached to multiple T1’s Currently, operations team makes this assignment manually This could/should be automated using network information For example, each T2 could be assigned to a native cloud by
operations team, and PanDA will assign to other clouds based on network performance metrics
May 13, 2014Kaushik De 19
![Page 20: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/20.jpg)
DDM Sonar Data
May 13, 2014Kaushik De 20
http://aipanda021.cern.ch/networking/t1tot2d_matrix/
![Page 21: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/21.jpg)
Tier 1 View
May 13, 2014Kaushik De 21
![Page 22: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/22.jpg)
More T1 Information
May 13, 2014Kaushik De 22
![Page 23: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/23.jpg)
Tier 2 View
May 13, 2014Kaushik De 23
![Page 24: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/24.jpg)
Improving Site Association
May 13, 2014Kaushik De 24
![Page 25: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/25.jpg)
More T2 Information
May 13, 2014Kaushik De 25
![Page 26: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/26.jpg)
Conclusion for Case 2
Working well in real time Currently implementing archival information
Keep data for last ‘n’ Tier 1 – Tier 2 associations Necessary to check robustness of approach Algorithm may use the historical information in the future
Expect to deploy this summer Hopefully ~1 month
May 13, 2014Kaushik De 26
![Page 27: PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.](https://reader035.fdocuments.us/reader035/viewer/2022062315/5697c00d1a28abf838cc95d8/html5/thumbnails/27.jpg)
Summary
First 2 use cases for network integration with PanDA working well Work will be completed this summer Metrics showing usefulness of approach will be available in Fall On track for timely final report to ANSE
May 13, 2014Kaushik De 27