+= 1 The STAR Unified Meta-Scheduler (SUMS) A front end around evolving technologies for user...
-
Upload
charlene-gallagher -
Category
Documents
-
view
216 -
download
0
Transcript of += 1 The STAR Unified Meta-Scheduler (SUMS) A front end around evolving technologies for user...
1
+ =
The STAR Unified Meta-Scheduler (SUMS)
A front end around evolving technologies for user analysis and data production.
Jérôme Lauret, Gabriele Carcassi, Levente HajduEfstratios Efstathiadis, Lidia Didenko, Valeri FineIwona Sakrejda, Doug Olson
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 2
Outline Project overview
STAR Experiment Problematic Solution
Design and architecture Basic principles Building blocks Add-on (usage tracking) Usage Grid experience
Schedulers Key features MonaLISA policy
Contributions GUI, dispatchers
Future work & Conclusion
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 3
Project overview
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 4
The STAR Experiment The Solenoidal Tracker At
RHIC http://www.star.bnl.gov/ is an
experiment located at BNL (USA) A collaboration of 546 people
wide, spanning over 12 countries
A PByte scale experiment overall (raw, reconstructed events, simulation) with large amount of files (several Million)
Run4 alone (2003-2004) has produced 200 TB of raw data
Rich set of data analysis and simulation problems
Expecting 200 TB of reconstructed data
40 TB of MuDST (1 pass) Files copied to Tier1 using
SRM tools (see Track 4, 344
?
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 5
Problematic Ongoing analysis
Past and new sets of data are constantly analyzed Data spread at many location
sites and storage type, some on distributed disk local to each machine not easily accessible
Evolving technologies Distributed computing (re) shapes itself as we make
progress: Condor-G, portals, Meta-Schedulers, Web Services, Grid Services, …
Batch technologies themselves evolve
Users have to adapt within a productive environment and ever growing scientific programMay be fine for new experiment, not for running ones
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 6
Solution Allow user to pursue scientific endeavor without
disruption Make use of current/available resources Ensure same productivity (subjective without matrix) Develop a front end shielding the user from technology
details and changes – Job concept Abstraction Attract users to migrate to new framework & Grid
=> data management, file relocation => Catalog
Design a tool/framework allowing for evolution Changing underlying technology should NOT mean change
in user’s daily routine Framework should allow for testing ideas, plug-in of new
components (Dispatcher for Local Resource Managers = LRMS), moving users to distributed computing with no extraneous knowledge
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 7
And so SUMS was born … Project started in 2002
Light developer team (<> ~ 1.0 FTE) Surrounding activities have enriched the project and
spawned activities and collaborations (Monitoring, U-JDL, Resource Brokering studies, …)
Historically STAR project, design and prototype responsibility taken by
WSU. Project enhanced and brought to user community
(Gabriele Carcassi)
Current development & design (Levente Hajdu)
Entirely written in Java Portable, modular class based design Project management, auto-documentation, …
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 8
Design / Architecture - Opened
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 9
Basic principles Users do NOT write
shell scripts and submit series of tag=value
Instead, they write an XML – U-JDL Describing their “intent” to work on files, a
DataSet, collections, etc … They do not have to know where those files are located
(LFN or collections may convert to PFN) They do not have to handle the gory details of resource
management (bsub –R …) They do not need to think where their job will best fit,
their input to SUMS are rates or ranges indications
Following a prescribed schema and …% star-submit MyJob.xml % star-submit-template –template MyTemplateJob.xml
–entities jobname=test,year=2004
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 10
What it does …
/star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie......
sched1043250413862_1.list / .csh
/star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie......
sched1043250413862_2.list / .csh
<job maxFilesPerProcess="500"> rootMacros/numberOfEventsList.C\
<stdout
/> <input
etype=daq_reco_mudst" preferStorage="local" nFiles="all"/> toURL="file:/star/u/xxx/scheduler/out/" />
/star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie......
sched1043250413862_0.list / .csh
/star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie......
Query/Wildcardresolution
<?xml version="1.0" encoding="utf-8" ?>
<command>root4star -q -b
(\"$FILELIST\"\)</command>
URL="file:/star/u/xxx/scheduler/out/$JOBID.out"
URL="catalog:star.bnl.gov?production=P02gd,fil
<output fromScratch="*.root"
</job>
Job descriptiontest.xml
User Input … () … Policy …. dispatcher
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 11
Architecture / building blocks
• Main boxes are javaclasses
• The framework choosesthe blocks to use depending on user options (% … -policy XXX)
• Interface between blocks are identical
• Implementations of the Policy class = the heart of SUMS (decision making, planning, resource brokering, …)Extendable, adaptable
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 12
Job Initializer
XML is validated, request objects created …
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 13
Queues Queue concept is “opened”
Queue can be a LRMS queue (PBS, LSF, SGE, …) Queue can be a Pool or a DRMS (Condor, Condor-G, …) A Web or Grid Service … anything for which a dispatcher can be written
The object container is defined or defines Defined by a name (may be logical) Associated to a dispatcher (has a pointer to a dispatcher
object) – LSFDispatcher uses logical name = queue name Has resource requirements
CPUtime limits, memory limits, the type of storage it can access, storage limits
Base rule: they can be undefined -1 (to be expected from Policy stand point)
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 14
Policies
Policies integrate pre-defined queues Serialized XML as local configuration A policy can make use of as many queues as necessary
Queues may have a type (LSF, PBS, Condor, …) a scope (Local, Distributed, …) Allows SUMS to decide which one to take depending on RB
decision Queues can be given an initial weight (for example, used
for ordering if weight = priority) Queues have a weight-incremental
Complex policies may order queues as necessary (your choice) – Default order by weight (priority)
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 15
Policy note – job splitting <input> element can take several form
Transition formats: PFN, PFN (wildcard) <input URL="file:/star/data15/reco/productionCentral/FullField/P02ge/2001/322/
st_physics_2322006_raw_0016.MuDst.root" /> <input URL="file:/star/data15/reco/productionCentral/FullField/P02ge/2001/*/*.MuDst.root" />
Locally distributed PFN support <input URL="file://rcas6078.rcf.bnl.gov/home/starreco/reco/productionCentral/FullField/
P02gd/2001/279/st_physics_2279005_raw_0285.MuDst.root" /> List support
<input URL="filelist:/star/u/user/username/filelists/mylist.list" /> Dataset, MetaData support
<input URL="catalog:star.bnl.gov?production=P02gd,filetype=daq_reco_mudst,storage=local" nFiles="2000" />
… LFN support on the way …
Preferred STAR usage: map MetaData/Collections or LFN to PFN, dispatch jobs --- BUT THERE ARE TWO WAYS ---
PFN converted (URL syntax do not end up in final lists, APPS work as usual) Lists are formatted and passed to APPS as URL, APPS need to sort URL
Example: rootd syntax like URL passed as-is
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 16
Dispatchers
High level dispatcher do a redirect to - PBS- LSF - SGE- Condor- Condor-G- BOSS- …
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 17
Add-On – Usage monitoring
Needed usage feedback - Monitoring user’s usage to Allow for a better targeted tool Focus can be made on most used/preferred feature CS fantasy trimmed down
Serves better the user community Eliminates divergence and re-focus Practicality first, SciFi later … Ensures equity of usage Helps re-focusing tutorials & documentation
JSP based (tomcat) with MySQL back-end All options and usage are recorded
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 18
Example of useful information …
Implemented two ways of accessing locally distributed files. Is it used ??
Added SGE dispatcher a few weeks ago …
Which storage type is mostused … may very well be a $$/ accessibility question
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 19
Example II-a
PDSF
BNL
4500 jobs /dayPeaks at 20k
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 20
Example II-bPessimistic graph is an integral count over time.It shows that after first usage, users keep using SUMS …
NB: Drop from the beginning of the summer indicates • Vacation time • Conference time • Lack of new data
(this is not the best period for SUMS commercial butinformative nonetheless)
See more statistics at http://www.star.bnl.gov/STAR/comp/Grid/scheduler/
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 21
Physicist usage As far as we know, 85% of active users using SUMS Publications selection / confirmed as 100% SUMS analysis based
J. Gonzales - Nuclear Experiment, abstractnucl-ex/0408016, Pseudorapidity Asymmetry and Centrality Dependence of Charged Hadron Spectra in d+Au Collisions at sqrt(SNN)=200 GeV (submitted to PRC)
L. S. Barnby – QM Proceedings - 2004 J. Phys. G: Nucl. Part. Phys. 30 S1121-S1124
T. Henry - Full jet reconstruction in d+Au and p+p collisions at RHIC, Journal of Physics G: Nuclear Physics (volume 30, issue 8) S1287
J.S. Lange - Proceedings 19th Winter Workshop on Nuclear Dynamics (2003), nucl-ex/0306005 - Review of search for heavy flavor (c,b quarks) production in leptonic decay channels in Au+Au collisions at sqrt(sNN)=200 GeV at the STAR Experiment at RHIC.
A. Tang - Anisotropy at RHIC: the first and the fourth harmonic … http://www.star.bnl.gov/central/publications/ (7 papers / analysis
submitted in the past 3 months)
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 22
Grid experience Use of SUMS for Grid job submissions possible
Modulo RSL extensions <input> <output> tags MUST specify path as relative path
(“bla.root”, “blop/test.dat”, …) <output> attribute fromScratch / toURL designed to bring the files
back (globus-url-copy)
Grid experience has been a challenge Cryptic messages, had a problem with a globus error 74: no clue of
what it was for months, no Grid Help-desk, no knowledge base index. Turned out to be a firewall issue, burst of massive job death
Nonetheless ¼ of Run4 simulation production made on grid
100,000 events generated, analysis ongoing Success rate
85% when all goes well 60% when lots of jobs are submitted (above issue)
Planning to run on larger scale platform, Grid3+ and/or OSG-0 with (hopefully) better ways to track errors/problems
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 23
Schedulers
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 24
Schedulers Can a user front end to other LRMS/DRMS be called a
“scheduler” ?? Is using the local resource within the same paradigm
than globally distributed resources ?
Traditional - LRMS Distributed - DRMS
Job Mostly Serialized Possibly following a Work-Flow
Data File based Data sets, collections, …
Scheduling One LRMS used Many – Issues are consistencies, QoS, unified information (from/to)
AAA Handled by LRMS VO based, ownership is itself an issue
Resources Dedicated or local policy managed (priority, usage throttle, …)
Common, no global policies but agreements or statement of understanding
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 25
Schedulers Key features for a scheduler
Keep global accounting Scheduling decisions may be based on
Resource availability, respect of local policies, fairshare (cluster autonomy) Advance reservation, best use of resources Network and data cache, data availability …
Job migration, moving jobs to/from a trusted cluster Spanning and workflow Human readable messages …
Scheduling algorithm can be complex Attempts to predict (Weather Services) has been proven difficult Dedicated Global accounting and standard messages possible Mixed of LRMS and DRMS capabilities (user autonomy) not common Complex algorithm takes into account so many parameters …
Empirical approach Inspect queue behavior, send jobs, see how queue reacts … re-
adjust Self-sustained system Adapts to network/resource/load changes ??
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 26
Empirical approach (?)
LSF
Monitoring Policy
Information fed by agents to ML
Information is recovered by SUMS module
Scheduling decisions made based on load and “queue” or “pool” response time
Self-sustained system (no need for %tage based submission branching)
Hopefully no need for complex algorithm
Respond as resources, priorities, bandwidth adjusts
Results / details in Efstratios Efstathiadis presentation, Track 4 - 393
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 27
Contributions RHIC/Phenix collaboration
have tested and using SUMS Contributions included
addition of dispatchers (PBS, BOSS) – Andrey Shevel
Development includes creation of GUI front end for end-users – Mike Reuter
Job tracking and monitoring SUMS allows for dispatching
to ANY queues BOSS (from CMS) a possible
solution as “a” dispatcher Implemented / contributed by
Andrey Shevel (Phenix/SUNY-SB) – Track 5, 86 BODE tracking
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 28
Future work
High Level User JDL work Started with a document on RDL (PPDG-39) Motivation
Current U-JDL simple enough but has its limitations Extension to new resource requirement possible but inelegant U-JDL considers most (but not all) data sets Lacks concept of tasks and sandboxes Workflow diagram are only AND (sequential) implemented (need OR, conditional
branching etc …)
SBIR with Tech-X (David Alexander) Deliverables
Enhanced and complete U-JDL (AJHDL) A WSDL for creating a Grid Service
Reviewed most available high level JDL Job Submission Description Language (JSDL) (GGF) Analysis Job Description Language (AJDL) (Atlas) User Request Description Language (URDL) (PPDG-39 / Jlab/STAR) Job Description Language (JDL) (DataGrid) Job Description Language (JDL) (JLab) …
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 29
Future work
We promised our users the U-JDL will not change For what they know, it won’t (XSLT, schema transformation) But the ones using AJHDL will have access to more features
We are working on job tracking
We are working on the concept of Meta-Log (application level monitoring)
Seems to be forgotten Valeri Fine – Poster, 480
Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 30
Conclusions SUMS is NOT
a batch system A toy (real needs, real use, real Physics)
SUMS is A front end to local and distributed RMS acting like a
client to multiple, heterogeneous RMS A flexible opened architecture, object oriented framework
in which with plug-and-play features A good environment for further developing
Standards (such as High level JDL) Scalability of other components (ML work, immediate use)
Used in STAR for real Physics (usage and publication list) Used for Distributed / Grid Simulation job submission Used successfully by other experiments A mean to make active users transition to distributed
computing and recover under-used resources … …