DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.
-
Upload
bonnie-gallagher -
Category
Documents
-
view
214 -
download
1
Transcript of DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.
DDM-Panda Issues
Kaushik DeKaushik De
University of Texas At ArlingtonUniversity of Texas At Arlington
DDM Workshop, BNLDDM Workshop, BNL
September 29, 2006September 29, 2006
September 29, 2006September 29, 2006Kaushik DeKaushik De 2
First – a Reminder
Panda was designed to minimally depend on robustness of Panda was designed to minimally depend on robustness of external middlewareexternal middleware
This does not apply to DDM – Panda fully depends on and This does not apply to DDM – Panda fully depends on and takes advantage of all DQ2 capabilitiestakes advantage of all DQ2 capabilities
Panda was the first ATLAS executor to use DQ2 for Panda was the first ATLAS executor to use DQ2 for production – 6 months before the LCG (still not fully done)production – 6 months before the LCG (still not fully done)
As you saw from Torre’s talk – Panda subscribes to As you saw from Torre’s talk – Panda subscribes to thousands of datasets weekly, and BNL catalog holds more thousands of datasets weekly, and BNL catalog holds more than a million records – leader in DQ2 deployment and usethan a million records – leader in DQ2 deployment and use
Panda-DDM is often used as an example of success in Panda-DDM is often used as an example of success in ATLAS – keep this up!ATLAS – keep this up!
September 29, 2006September 29, 2006Kaushik DeKaushik De 3
Some Open Issues
Deployment – need to do better (we have 19 installations)Deployment – need to do better (we have 19 installations) You have already heard from many speakers – no more to say
Data and catalog consistency and cleanup – next few slidesData and catalog consistency and cleanup – next few slides
Use output file callback in PandaUse output file callback in Panda
Performance and monitoring issuesPerformance and monitoring issues
Alexei reminded us – also need script to cleanup obsolete Alexei reminded us – also need script to cleanup obsolete datasetsdatasets
September 29, 2006September 29, 2006Kaushik DeKaushik De 4
Data Transfer Robustness
Need more robustness in DQ2 to recover from failuresNeed more robustness in DQ2 to recover from failures Examples in Alexei’s talk (but Panda usually had fewer problems) Important Panda issue: DQ2 should never give up on subscriptions But don’t kill site services because of retries – tricky balancing act! Force (email) human intervention if impossible to transfer file … this is normal hardening process – will continue
In the meantime, production must continueIn the meantime, production must continue Need to increase production rate by factor of 10 by summer 2007 In addition, there will always be some unavoidable error conditions We also need to do site cleanup of SE (cache turnover) Also, delete old temporary Panda datasets (safely): chron script So – some post DQ2 cleanup will always be necessary
September 29, 2006September 29, 2006Kaushik DeKaushik De 5
Proposal for DDM cleanup
Check and repair consistency of site catalogsCheck and repair consistency of site catalogs Script 1: re-register in local LRC all files found on local SE that are
registered in DQ2 central catalog, but not at BNL T1 Marco is working on this script, based on scripts written by Patrick
and Wensheng – need to run as chron at every site when stable
Script 2: move old missing files to BNL periodicallyScript 2: move old missing files to BNL periodically Chron run by Wensheng – need to define ‘old’
Script 3: safely cleanup SE space when getting fullScript 3: safely cleanup SE space when getting full Wensheng’s script works well – sites should take over running it
Keep log of all post-DQ2 repairs – feed back to developers Keep log of all post-DQ2 repairs – feed back to developers so that DQ2 can improve based on real experience (feed so that DQ2 can improve based on real experience (feed back into monitoring?)back into monitoring?)
September 29, 2006September 29, 2006Kaushik DeKaushik De 6
Site Responsibilities
Sites, sites, sites!Sites, sites, sites! Important difference between OSG sites and LCG sites – site
mangers have always been proactive within U.S. DDM Probably reflected in T1/T2 test results! Need to keep this up – sites should check DDM monitor daily
http://panda.atlascomp.org/?dash=prod&redirect=pandamonhttp://panda.atlascomp.org/?dash=prod&redirect=pandamon
Site is responsible for maintaining local storage element and keeping various services up and running
Sites should protect data in storage elements Some of our recent DDM problems have been site specific – need
help from DDM operations to help (and often fix mistakes)
September 29, 2006September 29, 2006Kaushik DeKaushik De 7
Output Callbacks
Converging on a solution – latest proposal by TorreConverging on a solution – latest proposal by Torre Add new Panda job state - ‘transferring’ Enable callback for output subscription blocks Panda will change ‘transferring’ -> ‘finished’ when callback received
Pros:Pros: Better tracking of output file transfers through Panda Production team can identify and report problems
Cons:Cons: Jobs may remain un-finished, even though file is available at T2
(physicist can get file through DQ2 – but job not in finished state) Panda queue may grow very large
September 29, 2006September 29, 2006Kaushik DeKaushik De 8
Live Examples
http://panda.atlascomp.org/?dash=prod&reload=yeshttp://panda.atlascomp.org/?dash=prod&reload=yes