DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

8
DDM-Panda Issues Kaushik De Kaushik De University of Texas At Arlington University of Texas At Arlington DDM Workshop, BNL DDM Workshop, BNL September 29, 2006 September 29, 2006

Transcript of DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

Page 1: DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

DDM-Panda Issues

Kaushik DeKaushik De

University of Texas At ArlingtonUniversity of Texas At Arlington

DDM Workshop, BNLDDM Workshop, BNL

September 29, 2006September 29, 2006

Page 2: DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

September 29, 2006September 29, 2006Kaushik DeKaushik De 2

First – a Reminder

Panda was designed to minimally depend on robustness of Panda was designed to minimally depend on robustness of external middlewareexternal middleware

This does not apply to DDM – Panda fully depends on and This does not apply to DDM – Panda fully depends on and takes advantage of all DQ2 capabilitiestakes advantage of all DQ2 capabilities

Panda was the first ATLAS executor to use DQ2 for Panda was the first ATLAS executor to use DQ2 for production – 6 months before the LCG (still not fully done)production – 6 months before the LCG (still not fully done)

As you saw from Torre’s talk – Panda subscribes to As you saw from Torre’s talk – Panda subscribes to thousands of datasets weekly, and BNL catalog holds more thousands of datasets weekly, and BNL catalog holds more than a million records – leader in DQ2 deployment and usethan a million records – leader in DQ2 deployment and use

Panda-DDM is often used as an example of success in Panda-DDM is often used as an example of success in ATLAS – keep this up!ATLAS – keep this up!

Page 3: DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

September 29, 2006September 29, 2006Kaushik DeKaushik De 3

Some Open Issues

Deployment – need to do better (we have 19 installations)Deployment – need to do better (we have 19 installations) You have already heard from many speakers – no more to say

Data and catalog consistency and cleanup – next few slidesData and catalog consistency and cleanup – next few slides

Use output file callback in PandaUse output file callback in Panda

Performance and monitoring issuesPerformance and monitoring issues

Alexei reminded us – also need script to cleanup obsolete Alexei reminded us – also need script to cleanup obsolete datasetsdatasets

Page 4: DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

September 29, 2006September 29, 2006Kaushik DeKaushik De 4

Data Transfer Robustness

Need more robustness in DQ2 to recover from failuresNeed more robustness in DQ2 to recover from failures Examples in Alexei’s talk (but Panda usually had fewer problems) Important Panda issue: DQ2 should never give up on subscriptions But don’t kill site services because of retries – tricky balancing act! Force (email) human intervention if impossible to transfer file … this is normal hardening process – will continue

In the meantime, production must continueIn the meantime, production must continue Need to increase production rate by factor of 10 by summer 2007 In addition, there will always be some unavoidable error conditions We also need to do site cleanup of SE (cache turnover) Also, delete old temporary Panda datasets (safely): chron script So – some post DQ2 cleanup will always be necessary

Page 5: DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

September 29, 2006September 29, 2006Kaushik DeKaushik De 5

Proposal for DDM cleanup

Check and repair consistency of site catalogsCheck and repair consistency of site catalogs Script 1: re-register in local LRC all files found on local SE that are

registered in DQ2 central catalog, but not at BNL T1 Marco is working on this script, based on scripts written by Patrick

and Wensheng – need to run as chron at every site when stable

Script 2: move old missing files to BNL periodicallyScript 2: move old missing files to BNL periodically Chron run by Wensheng – need to define ‘old’

Script 3: safely cleanup SE space when getting fullScript 3: safely cleanup SE space when getting full Wensheng’s script works well – sites should take over running it

Keep log of all post-DQ2 repairs – feed back to developers Keep log of all post-DQ2 repairs – feed back to developers so that DQ2 can improve based on real experience (feed so that DQ2 can improve based on real experience (feed back into monitoring?)back into monitoring?)

Page 6: DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

September 29, 2006September 29, 2006Kaushik DeKaushik De 6

Site Responsibilities

Sites, sites, sites!Sites, sites, sites! Important difference between OSG sites and LCG sites – site

mangers have always been proactive within U.S. DDM Probably reflected in T1/T2 test results! Need to keep this up – sites should check DDM monitor daily

http://panda.atlascomp.org/?dash=prod&redirect=pandamonhttp://panda.atlascomp.org/?dash=prod&redirect=pandamon

Site is responsible for maintaining local storage element and keeping various services up and running

Sites should protect data in storage elements Some of our recent DDM problems have been site specific – need

help from DDM operations to help (and often fix mistakes)

Page 7: DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

September 29, 2006September 29, 2006Kaushik DeKaushik De 7

Output Callbacks

Converging on a solution – latest proposal by TorreConverging on a solution – latest proposal by Torre Add new Panda job state - ‘transferring’ Enable callback for output subscription blocks Panda will change ‘transferring’ -> ‘finished’ when callback received

Pros:Pros: Better tracking of output file transfers through Panda Production team can identify and report problems

Cons:Cons: Jobs may remain un-finished, even though file is available at T2

(physicist can get file through DQ2 – but job not in finished state) Panda queue may grow very large

Page 8: DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

September 29, 2006September 29, 2006Kaushik DeKaushik De 8

Live Examples

http://panda.atlascomp.org/?dash=prod&reload=yeshttp://panda.atlascomp.org/?dash=prod&reload=yes