+= 1 The STAR Unified Meta-Scheduler (SUMS) A front end around evolving technologies for user...

1

+ =

The STAR Unified Meta-Scheduler (SUMS)

A front end around evolving technologies for user analysis and data production.

Jérôme Lauret, Gabriele Carcassi, Levente HajduEfstratios Efstathiadis, Lidia Didenko, Valeri FineIwona Sakrejda, Doug Olson

Sept 27th- Oct 1st 2004 Jérôme LAURET, RHIC-STAR/BNL 2

Outline Project overview

STAR Experiment Problematic Solution

Design and architecture Basic principles Building blocks Add-on (usage tracking) Usage Grid experience

Schedulers Key features MonaLISA policy

Contributions GUI, dispatchers

Future work & Conclusion


Project overview


The STAR Experiment The Solenoidal Tracker At

RHIC http://www.star.bnl.gov/ is an

experiment located at BNL (USA) A collaboration of 546 people

wide, spanning over 12 countries

A PByte scale experiment overall (raw, reconstructed events, simulation) with large amount of files (several Million)

Run4 alone (2003-2004) has produced 200 TB of raw data

Rich set of data analysis and simulation problems

Expecting 200 TB of reconstructed data

40 TB of MuDST (1 pass) Files copied to Tier1 using

SRM tools (see Track 4, 344

?


Problematic Ongoing analysis

Past and new sets of data are constantly analyzed Data spread at many location

sites and storage type, some on distributed disk local to each machine not easily accessible

Evolving technologies Distributed computing (re) shapes itself as we make

progress: Condor-G, portals, Meta-Schedulers, Web Services, Grid Services, …

Batch technologies themselves evolve

Users have to adapt within a productive environment and ever growing scientific programMay be fine for new experiment, not for running ones


Solution Allow user to pursue scientific endeavor without

disruption Make use of current/available resources Ensure same productivity (subjective without matrix) Develop a front end shielding the user from technology

details and changes – Job concept Abstraction Attract users to migrate to new framework & Grid

=> data management, file relocation => Catalog

Design a tool/framework allowing for evolution Changing underlying technology should NOT mean change

in user’s daily routine Framework should allow for testing ideas, plug-in of new

components (Dispatcher for Local Resource Managers = LRMS), moving users to distributed computing with no extraneous knowledge


And so SUMS was born … Project started in 2002

Light developer team (<> ~ 1.0 FTE) Surrounding activities have enriched the project and

spawned activities and collaborations (Monitoring, U-JDL, Resource Brokering studies, …)

Historically STAR project, design and prototype responsibility taken by

WSU. Project enhanced and brought to user community

(Gabriele Carcassi)

Current development & design (Levente Hajdu)

Entirely written in Java Portable, modular class based design Project management, auto-documentation, …


Design / Architecture - Opened


Basic principles Users do NOT write

shell scripts and submit series of tag=value

Instead, they write an XML – U-JDL Describing their “intent” to work on files, a

DataSet, collections, etc … They do not have to know where those files are located

(LFN or collections may convert to PFN) They do not have to handle the gory details of resource

management (bsub –R …) They do not need to think where their job will best fit,

their input to SUMS are rates or ranges indications

Following a prescribed schema and …% star-submit MyJob.xml % star-submit-template –template MyTemplateJob.xml

–entities jobname=test,year=2004


What it does …

/star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie......

sched1043250413862_1.list / .csh


sched1043250413862_2.list / .csh

<job maxFilesPerProcess="500"> rootMacros/numberOfEventsList.C\

<stdout

/> <input

etype=daq_reco_mudst" preferStorage="local" nFiles="all"/> toURL="file:/star/u/xxx/scheduler/out/" />


sched1043250413862_0.list / .csh

/star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie......

Query/Wildcardresolution

<?xml version="1.0" encoding="utf-8" ?>

<command>root4star -q -b

(\"$FILELIST\"\)</command>

URL="file:/star/u/xxx/scheduler/out/$JOBID.out"

URL="catalog:star.bnl.gov?production=P02gd,fil

<output fromScratch="*.root"

</job>

Job descriptiontest.xml

User Input … () … Policy …. dispatcher


Architecture / building blocks

• Main boxes are javaclasses

• The framework choosesthe blocks to use depending on user options (% … -policy XXX)

• Interface between blocks are identical

• Implementations of the Policy class = the heart of SUMS (decision making, planning, resource brokering, …)Extendable, adaptable


Job Initializer

XML is validated, request objects created …


Queues Queue concept is “opened”

Queue can be a LRMS queue (PBS, LSF, SGE, …) Queue can be a Pool or a DRMS (Condor, Condor-G, …) A Web or Grid Service … anything for which a dispatcher can be written

The object container is defined or defines Defined by a name (may be logical) Associated to a dispatcher (has a pointer to a dispatcher

object) – LSFDispatcher uses logical name = queue name Has resource requirements

CPUtime limits, memory limits, the type of storage it can access, storage limits

Base rule: they can be undefined -1 (to be expected from Policy stand point)


Policies

Policies integrate pre-defined queues Serialized XML as local configuration A policy can make use of as many queues as necessary

Queues may have a type (LSF, PBS, Condor, …) a scope (Local, Distributed, …) Allows SUMS to decide which one to take depending on RB

decision Queues can be given an initial weight (for example, used

for ordering if weight = priority) Queues have a weight-incremental

Complex policies may order queues as necessary (your choice) – Default order by weight (priority)


Policy note – job splitting <input> element can take several form

Transition formats: PFN, PFN (wildcard) <input URL="file:/star/data15/reco/productionCentral/FullField/P02ge/2001/322/

st_physics_2322006_raw_0016.MuDst.root" /> <input URL="file:/star/data15/reco/productionCentral/FullField/P02ge/2001/*/*.MuDst.root" />

Locally distributed PFN support <input URL="file://rcas6078.rcf.bnl.gov/home/starreco/reco/productionCentral/FullField/

P02gd/2001/279/st_physics_2279005_raw_0285.MuDst.root" /> List support

<input URL="filelist:/star/u/user/username/filelists/mylist.list" /> Dataset, MetaData support

<input URL="catalog:star.bnl.gov?production=P02gd,filetype=daq_reco_mudst,storage=local" nFiles="2000" />

… LFN support on the way …

Preferred STAR usage: map MetaData/Collections or LFN to PFN, dispatch jobs --- BUT THERE ARE TWO WAYS ---

PFN converted (URL syntax do not end up in final lists, APPS work as usual) Lists are formatted and passed to APPS as URL, APPS need to sort URL

Example: rootd syntax like URL passed as-is


Dispatchers

High level dispatcher do a redirect to - PBS- LSF - SGE- Condor- Condor-G- BOSS- …


Add-On – Usage monitoring

Needed usage feedback - Monitoring user’s usage to Allow for a better targeted tool Focus can be made on most used/preferred feature CS fantasy trimmed down

Serves better the user community Eliminates divergence and re-focus Practicality first, SciFi later … Ensures equity of usage Helps re-focusing tutorials & documentation

JSP based (tomcat) with MySQL back-end All options and usage are recorded


Example of useful information …

Implemented two ways of accessing locally distributed files. Is it used ??

Added SGE dispatcher a few weeks ago …

Which storage type is mostused … may very well be a $$/ accessibility question


Example II-a

PDSF

BNL

4500 jobs /dayPeaks at 20k


Example II-bPessimistic graph is an integral count over time.It shows that after first usage, users keep using SUMS …

NB: Drop from the beginning of the summer indicates • Vacation time • Conference time • Lack of new data

(this is not the best period for SUMS commercial butinformative nonetheless)

See more statistics at http://www.star.bnl.gov/STAR/comp/Grid/scheduler/


Physicist usage As far as we know, 85% of active users using SUMS Publications selection / confirmed as 100% SUMS analysis based

J. Gonzales - Nuclear Experiment, abstractnucl-ex/0408016, Pseudorapidity Asymmetry and Centrality Dependence of Charged Hadron Spectra in d+Au Collisions at sqrt(SNN)=200 GeV (submitted to PRC)

L. S. Barnby – QM Proceedings - 2004 J. Phys. G: Nucl. Part. Phys. 30 S1121-S1124

T. Henry - Full jet reconstruction in d+Au and p+p collisions at RHIC, Journal of Physics G: Nuclear Physics (volume 30, issue 8) S1287

J.S. Lange - Proceedings 19th Winter Workshop on Nuclear Dynamics (2003), nucl-ex/0306005 - Review of search for heavy flavor (c,b quarks) production in leptonic decay channels in Au+Au collisions at sqrt(sNN)=200 GeV at the STAR Experiment at RHIC.

A. Tang - Anisotropy at RHIC: the first and the fourth harmonic … http://www.star.bnl.gov/central/publications/ (7 papers / analysis

submitted in the past 3 months)


Grid experience Use of SUMS for Grid job submissions possible

Modulo RSL extensions <input> <output> tags MUST specify path as relative path

(“bla.root”, “blop/test.dat”, …) <output> attribute fromScratch / toURL designed to bring the files

back (globus-url-copy)

Grid experience has been a challenge Cryptic messages, had a problem with a globus error 74: no clue of

what it was for months, no Grid Help-desk, no knowledge base index. Turned out to be a firewall issue, burst of massive job death

Nonetheless ¼ of Run4 simulation production made on grid

100,000 events generated, analysis ongoing Success rate

85% when all goes well 60% when lots of jobs are submitted (above issue)

Planning to run on larger scale platform, Grid3+ and/or OSG-0 with (hopefully) better ways to track errors/problems


Schedulers


Schedulers Can a user front end to other LRMS/DRMS be called a

“scheduler” ?? Is using the local resource within the same paradigm

than globally distributed resources ?

Traditional - LRMS Distributed - DRMS

Job Mostly Serialized Possibly following a Work-Flow

Data File based Data sets, collections, …

Scheduling One LRMS used Many – Issues are consistencies, QoS, unified information (from/to)

AAA Handled by LRMS VO based, ownership is itself an issue

Resources Dedicated or local policy managed (priority, usage throttle, …)

Common, no global policies but agreements or statement of understanding


Schedulers Key features for a scheduler

Keep global accounting Scheduling decisions may be based on

Resource availability, respect of local policies, fairshare (cluster autonomy) Advance reservation, best use of resources Network and data cache, data availability …

Job migration, moving jobs to/from a trusted cluster Spanning and workflow Human readable messages …

Scheduling algorithm can be complex Attempts to predict (Weather Services) has been proven difficult Dedicated Global accounting and standard messages possible Mixed of LRMS and DRMS capabilities (user autonomy) not common Complex algorithm takes into account so many parameters …

Empirical approach Inspect queue behavior, send jobs, see how queue reacts … re-

adjust Self-sustained system Adapts to network/resource/load changes ??


Empirical approach (?)

LSF

Monitoring Policy

Information fed by agents to ML

Information is recovered by SUMS module

Scheduling decisions made based on load and “queue” or “pool” response time

Self-sustained system (no need for %tage based submission branching)

Hopefully no need for complex algorithm

Respond as resources, priorities, bandwidth adjusts

Results / details in Efstratios Efstathiadis presentation, Track 4 - 393


Contributions RHIC/Phenix collaboration

have tested and using SUMS Contributions included

addition of dispatchers (PBS, BOSS) – Andrey Shevel

Development includes creation of GUI front end for end-users – Mike Reuter

Job tracking and monitoring SUMS allows for dispatching

to ANY queues BOSS (from CMS) a possible

solution as “a” dispatcher Implemented / contributed by

Andrey Shevel (Phenix/SUNY-SB) – Track 5, 86 BODE tracking


Future work

High Level User JDL work Started with a document on RDL (PPDG-39) Motivation

Current U-JDL simple enough but has its limitations Extension to new resource requirement possible but inelegant U-JDL considers most (but not all) data sets Lacks concept of tasks and sandboxes Workflow diagram are only AND (sequential) implemented (need OR, conditional

branching etc …)

SBIR with Tech-X (David Alexander) Deliverables

Enhanced and complete U-JDL (AJHDL) A WSDL for creating a Grid Service

Reviewed most available high level JDL Job Submission Description Language (JSDL) (GGF) Analysis Job Description Language (AJDL) (Atlas) User Request Description Language (URDL) (PPDG-39 / Jlab/STAR) Job Description Language (JDL) (DataGrid) Job Description Language (JDL) (JLab) …


Future work

We promised our users the U-JDL will not change For what they know, it won’t (XSLT, schema transformation) But the ones using AJHDL will have access to more features

We are working on job tracking

We are working on the concept of Meta-Log (application level monitoring)

Seems to be forgotten Valeri Fine – Poster, 480


Conclusions SUMS is NOT

a batch system A toy (real needs, real use, real Physics)

SUMS is A front end to local and distributed RMS acting like a

client to multiple, heterogeneous RMS A flexible opened architecture, object oriented framework

in which with plug-and-play features A good environment for further developing

Standards (such as High level JDL) Scalability of other components (ML work, immediate use)

Used in STAR for real Physics (usage and publication list) Used for Distributed / Grid Simulation job submission Used successfully by other experiments A mean to make active users transition to distributed

computing and recover under-used resources … …

+= 1 The STAR Unified Meta-Scheduler (SUMS) A front end around evolving technologies for user...

Documents

Transcript of += 1 The STAR Unified Meta-Scheduler (SUMS) A front end around evolving technologies for user...