Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience
-
Upload
igor-sfiligoi -
Category
Technology
-
view
220 -
download
0
description
Transcript of Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience
![Page 1: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/1.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 1
Adaptive 2011
Adapting to the Unknown With a few Simple Rules:
The glideinWMS Experience
by Igor Sfiligoi1,Benjamin Hass1, Frank Würthwein1, and Burt Holzman2
1UCSD 2FNAL
![Page 2: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/2.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 2
The Grid landscape
● Many highly autonomous Grid sites● Many diverse user communities
● How can users efficiently schedule their jobs?
Within ScientificGrid environments(e.g. OSG, EGI)
Nothingshared
![Page 3: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/3.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 3
Scheduling problem
● Grid sites expose only partial information● Access to finer details restricted to site admins
● Each user community wants independence● No centralized, Grid-wide job scheduling
● As a result● Cannot accurately predict even the near future● Partitioning across sites mostly a guesswork
● Adapting to the ever-changing state a must
![Page 4: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/4.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 4
Traditional approaches
● Force sites to expose as much info as possible● Sites end up publishing lots of garbage
● Implement retries● Long tail before ALL jobs in a workflow finish
● Start at many sites concurrently, then kill some● Wasteful and with semantic problems
● Mediocre results and complex code
![Page 5: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/5.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 5
The glideinWMS
● The glideinWMS approach to the problem● Use the pilot paradigm
● Pressure basedscheduling
● Avoid usingexternal information
● Range reduction
The glideinWMS is a Grid job scheduler initially developed at FNAL by the CMS experiment
● Based on the CDF glideCAF concept
● With contributions from several other institutes
● Widely used in OSG, with a large instance at UCSD
![Page 6: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/6.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 6
The pilot paradigm
● Send pilots to Grid sites (never user jobs)
● Create a dynamic overlay pool of compute resources● Jobs scheduled within
this overlay pool● Scheduling in the
overlay pool easy● Complete info● Full control
● Problem moved tothe pilot submitter Site N
Site 1
Pilot
Pilot
Overlaypool
Pilots not user specific
Pilotsubmitter
One poolx
user community
![Page 7: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/7.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 7
Is pilot scheduling easier?
● User jobs● Every job is important
=> users wait for last to finish
● A failed job is a problem for the user
● Many users => priority handling
● Pilot jobs● All the same● A failed pilot job is just
wasted CPU time● Single credential =>
no need to prioritize between them
● Must handle each and every one
● Only number of pilot jobs counts
![Page 8: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/8.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 8
Pressure based scheduling
● The glideinWMS pilot scheduling based on theconcept of pilot pressure● Keep a fixed no. of pending pilots
in remote queues● Site by site
● Furthermore, split pilot scheduling from pilot submission● Scheduling in VO frontend Site N
Site 1
Pilot
Pilot
VOfrontend
P
R Glidein factory
Glidein factory
![Page 9: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/9.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 9
Determining the pressure
● Calculating the proper pressure important● Too low => small overlay pool => long job wait● Too high => on jobs when pilot starts => wasted CPU
● Must be recalculated often● Each site has its own pressure● Input to pressure calculation
● Only no. matching pending(i.e. idle) user jobs● Grid status incomplete and unreliable
● Some jobs that can run on multiple Grid sites● Count them as the appropriate fraction against each
Ps(t)=f(I
s(t))
![Page 10: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/10.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 10
Simple pressure function
● Experience tells us Grid jobs have relatively flat start and terminate rates● Typical O(10/few mins), max O(100/few mins)● So pressure can be capped in the O(10) range
● Small range => tuning only when few jobs● Using simple heuristic of dividing by 3● Just to have a reasonable edge-case policy
f(Is(t)) = min(Is(t)/3,Cs)
![Page 11: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/11.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 11
Operational experience (1)
● CMS@UCSD has 2 years of experience● Serving O(4k) users● Using about O(100) Grid sites
located in the Americas, Europe and Asia
Grid sites concurrently used
Status of the overlay pool
![Page 12: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/12.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 12
Operational experience (2)
● CMS@UCSD has 2 years of experience● The glideinWMS logic works very efficiently
● Quick job startup times
Status of the overlay pool
![Page 13: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/13.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 13
Operational experience (3)
● CMS@UCSD has 2 years of experience● The glideinWMS logic works very efficiently
● Quick job startup times● Little over-provisioning (~5%) Status of the overlay pool
![Page 14: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/14.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 14
Related work
● Non-pilot WMS (i.e. direct submission)● gLite WMS and OSG MM● More complex and brittle since they require
accurate and complete info from Grid sites● Pilot WMS
● PANDA– Pressure based, with basically
constant pressure over time => high load on sites● DIRAC and MyCluster
– Require services at Grid sites to gather site state=> many Grid sites do not allow this
![Page 15: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/15.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 15
Summary
● Direct Grid-wide job scheduling is hard● Pilot paradigm simplifies it
by making it uniform● The glideinWMS use pressure logic
● Based on number of pending user jobs only● Pressure function capped => simple rules● CMS experience at UCSD shows it works
● and it works well
![Page 16: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/16.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 16
For more information
● The glideinWMS home pagehttp://tinyurl.com/glideinWMS
● Relevant papers:● I. Sfiligoi et al.,
"The pilot way to grid resources using glideinWMS," CSIE, WRI World Cong. on, vol. 2, pp. 428-432, 2009, doi:10.1109/CSIE.2009.950
● The CMS Collaboration et al. “The CMS experiment at the CERN LHC,” J. Inst, vol. 3, S08004, pp. 1-334, 2008, doi:10.1088/1748-0221/3/08/S08004
![Page 17: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/17.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 17
Acknowledgment
● This work is partially sponsored by ● the US Department of Energy under Grant
No. DE-FC02-06ER41436 subcontract No. 647F290 (OSG), and
● the US National Science Foundation under Grant No. PHY-0612805 (CMS Maintenance & Operations).
![Page 18: Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience](https://reader034.fdocuments.us/reader034/viewer/2022052321/5554a3ebb4c905fd608b4d06/html5/thumbnails/18.jpg)
Rome, Sep 2011 Adapting with few simple rules in glideinWMS 18
Copyright notice
● This presentation contains graphics copyright ofToon-a-daythat was licensed to Igor Sfiligoi for use in this presentation
● Any other use strictly prohibited