Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
-
Upload
morgan-oneal -
Category
Documents
-
view
218 -
download
2
Transcript of Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Workload Management WP
Status and next steps
Massimo SgaravattoINFN Padova
Where we are CMS-HLT use case (Monte Carlo
production and reconstruction) analyzed in terms of GRID requirements and GRID tools availability Discussions with Globus team and Condor
team Definition of a prototype architecture of
workload management system Use of Globus and Condor mechanisms But major developments needed
Prototype workload management system architecture
GlobusGRAM
CONDOR
GlobusGRAM
LSF
GlobusGRAM
PBS
Site1Site2 Site3
condor_submit(Globus Universe)
Condor-G
Master Grid InformationService (GIS)
Submit jobs
ResourceDiscovery
LocalResource
ManagementSystems
Globus GRAMas uniform interface
to different local resource management systems
Condor-G able toprovide a
reliable/crashproof job submission service
Master chooses in whichGlobus resources the jobs
must be submitted
Farms
Info
Where we are Evaluating the existing components (D1.1) and “putting
together” the various building blocks Evaluation of Globus
Collaboration with WP 1 of INFN-GRID project (Evaluation of the Globus toolkit) http://www.infn.it/globus
Evaluation of Globus GRAM GRAM as uniform interface to different underlying resource management
systems Evaluation of RSL “Cooperation” between GRAM and GIS
Evaluation of Condor-G The current implementation is a prototype
It works, but some problems must be solved Globus + Condor-G tested with a real CMS MC production
Many many many memory leaks found in the Globus jobmanager !!! Fixes (provided by Francesco Prelz) submitted to Globus team
Feedback only for what concerning the bugs in the GAA and GSS modules (new fixes “merged” with the original ones)
First deliverables Month 3: Report on current technology
(report) D1.1 Month 6: Definition of architecture for
scheduling, resource management, security and job description (report) D1.2
Month 9: Components and documentation for the 1st release: initial workload management system (prototype) D1.3
Proposed work plan Let’s continue the implementation of the proposed
prototype Evaluation of current technologies (Globus, Condor) (D1.1) Functionalities for the 1st release
First release We can propose the functionalities that could be
implemented “Negotiation” in the ATF
To understand if these functionalities “address” the proposed use cases
To understand if our module can be “plugged” together with the other “pieces”
To understand if the other WPs can provide the required (by WP 1) functionalities
Proposed functionalities for the 1st release
First version of job description language (JDL)
First version of broker (master), that decides where to submit the jobs
Job submission service First version of logging and
bookkeeping services First user interface
Job Description Language (JDL) Used when the job is submitted, to specify
The application The input data set
File ? Collection of files ? “Logical” or “physical” names ? Need to be discussed with WP 2, WP 8, ATF
Where the output data must be saved (Required and preferable) resources Info for bookkeeping … ???
Prototype: Condor ClassAds
Broker/Master Choice of resource (farm) where to
submit job Input: JDL expression Output: computing resource choice
Published resource access lists (gridmap-files in the Globus-based prototype) are checked as a first step in the resource match-making
Broker/Master The “accessible” computing resources are
matched with the job request according to: Availability of the requested input data set
In the 1st release the broker will have to choose a resource where this input data set is already available (we are not going to “trigger” the replica of the input data set)
Availability of the appropriate application "sandbox“ If necessary, it could be necessary to "copy" and install
this sandbox if not already available in the executing farm (“code migration”) (in the 1st release ???)
Queue characteristics and status (architecture, etc…) vs. job requests
Let’s start with a few, simple parameters Availability of the requested amount of scratch space
Broker/Master We assume that all the information
needed by the broker are “published” in one “Grid Information Space” (GIS in the Globus-based prototype) by the other WPs
Prototype: Condor matchmaking library Match between the info published in the GIS
and the ClassAds defined in the JDL Necessary a “translator” GIS attributes
ClassAds Some work already done by Globus team ???
Job submission service Input: job to submit + computing resource
choice (provided by broker) Reliable, fault tolerant, crash proof service
Reliability in the executing machines up to WP 4
Prototype: Condor-G Submission of jobs to Globus resources (farms) New implementation of Condor-G (+ new
Globus job manager) available soon
“Code” migration Not easy at all !!!
Necessary to “install” in the target farm a complex run time environment
Necessary a STRONG collaboration with WP 8 (and WP 4) to define an “application sandbox”, that can easily be installed in one farm, and doesn’t “conflict” with other sandboxes
Use of “application repositories” ??? When an application must be installed on one
farm, the sandbox is downloaded from such repository
Bookkeeping Necessary to “record” for each job
Submitting user identity Input data Output data Status of processing Where and when the processing has been
done Other bookkeeping info specified in the JDL …???
Logging Necessary to keep tracks of the
significant events occurred in the system Requests by users Computing resource choice (by
broker) Submission to resource …???
User Interface Job management
Job submission Job removal Job status monitoring
Access to bookkeeping info Access to logging info …???