Job Submission
description
Transcript of Job Submission
![Page 1: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/1.jpg)
EGEE-II INFSO-RI-031688
Enabling Grids for E-sciencE
www.eu-egee.org
EGEE and gLite are registered trademarks
Job Submission
Fokke Dijkstra RuG/SARA
Grid tutorial Groningen September 2006
![Page 2: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/2.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Contents
• The LCG Workload Management System (WMS) in gLite
• Job Submission to EGEE / NL-Grid– Job Preparation– A simple example & Job Lifecycle– Job Description Language (JDL)– Job Submission & Monitoring– Some more advanced topics
![Page 3: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/3.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
WMS
?
![Page 4: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/4.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The LCG WMS
• The user submits jobs via the Workload Management System
• The Goal of WMS is the distributed scheduling and resource management in a Grid environment.
• What does it allow Grid users to do?To submit their jobs
To execute them
To get information about their status
To retrieve their output
• The WMS tries to – Optimize the usage of resources– Execute user jobs as fast as possible
![Page 5: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/5.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Logging &Bookkeeping(LB)
ResourceBroker (RB)
Job SubmissionService (JSS)
StorageElement(SE)
ComputingElement (CE)
Information System (BDII)
LCGFileCatalog(LFC)
JDL
User Interface (UI)
WMS components
![Page 6: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/6.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Job Preparation
• You need to provide– A complete (enough) job description
What program? What data? Any requirements on OS, installed software, ??
– Possibly a program You’re submitting in unknown territory! Program portably! Don’t rely on hard-coded paths or special locations The program you send may not even be in $HOME!
– Perhaps some input data– Perhaps instructions on what to do with the output
![Page 7: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/7.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
How to Write a Job Description
• Here is a minimal job description (call it hello.jdl)
• We specified– The program to run and its arguments– Directed the standard error and output streams to files– Told it what to do with the output
Executable = “/bin/echo”;Arguments = “Goedemiddag”;StdError = “stderr.log”;StdOutput = “stdout.log”;OutputSandbox = {“stderr.log”, “stdout.log”};
![Page 8: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/8.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Job Submission Example
• User issues a voms-proxy-init– enters his certificate’s password– Receives a valid Globus proxy
• User issues a: edg-job-submit mytest.jdl and gets back from the system a unique Job Identifier (JobId)
• User issues a: edg-job-status JobId to get logging information about the current status of his Job
• When the “OutputReady” status is reached, the user can issue a edg-job-get-output JobId
and the system returns the name of the temporary directory where the job output can be found on the UI machine.
![Page 9: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/9.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Submitting it
$ voms-proxy-init --voms tutorYour identity: /O=edgtutorial/O=users/O=rug/OU=rc/CN=Fokke DijkstraEnter GRID pass phrase:Creating temporary
proxy ................................................................. DoneContacting mu4.matrix.sara.nl:30007
[/O=dutchgrid/O=hosts/OU=sara.nl/CN=mu4.matrix.sara.nl] "tutor" DoneCreating proxy .............................................. DoneYour proxy is valid until Mon Sep 11 23:22:12 2006
$ edg-job-submit hello.jdlSelected Virtual Organisation name (from UI conf file): tutorConnecting to host mu3.matrix.sara.nl, port 7772Logging to host mu3.matrix.sara.nl, port 9002
******************************************************************************* JOB SUBMIT OUTCOME The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier
(edg_jobId) is:
- https://mu3.matrix.sara.nl:9000/Nz6PWWJCjtT7YY3PJWDu5Q
*******************************************************************************
JobId
![Page 10: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/10.jpg)
A Job Submission Example
UIJDL
Logging &Bookkeeping(LB)
ResourceBroker (RB)
Job SubmissionService (JSS)
StorageElement(SE)
ComputingElement (CE)
Information System (IS)
Job SubmitEvent
Input Sandbox
Job Status
submittedLCGFileCatalog(LFC)
User Interface (UI)
Job Status
waiting
![Page 11: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/11.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Checking the status
$ edg-job-status https://mu3.matrix.sara.nl:9000/Nz6PWWJCjtT7YY3PJWDu5Q
*************************************************************BOOKKEEPING INFORMATION:
Status info for the Job : https://mu3.matrix.sara.nl:9000/Nz6PWWJCjtT7YY3PJWDu5Q
Current Status: Done (Success)Exit code: 0Status Reason: Job terminated successfullyDestination: mu6.matrix.sara.nl:2119/jobmanager-pbs-longreached on: Tue Jun 1 08:14:25 2004*************************************************************
![Page 12: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/12.jpg)
A Job Submission Example
UIJDL
Logging &Bookkeeping(LB)
ResourceBroker (RB)
Job SubmissionService (JSS)
StorageElement(SE)
ComputingElement (CE)
Information System (IS)
LCGFileCatalog(LFC)
Job Status
submitted
waiting
User Interface (UI)
Job Status
ready
BrokerInfo
scheduled
Job Status
Job Status
running
Input Sandbox
Output Sandbox
done
outputready
![Page 13: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/13.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Getting the Output
$ edg-job-get-output https://mu3.matrix.sara.nl:9000/Nz6PWWJCjtT7YY3PJWDu5Q
Retrieving files from host: mu3.matrix.sara.nl ( for https://mu3.matrix.sara.nl:9000/Nz6PWWJCjtT7YY3PJWDu5Q )
******************************************************************************* JOB GET OUTPUT OUTCOME
Output sandbox files for the job: - https://mu3.matrix.sara.nl:9000/Nz6PWWJCjtT7YY3PJWDu5Q have been successfully retrieved and stored in the directory: /tmp/jobOutput/fokke_Nz6PWWJCjtT7YY3PJWDu5Q
*******************************************************************************
$ cat /tmp/jobOutput/fokke_Nz6PWWJCjtT7YY3PJWDu5Q/std.outGoedemiddag
![Page 14: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/14.jpg)
A Job Submission Example
UIJDL
Logging &Bookkeeping(LB)
ResourceBroker (RB)
Job SubmissionService (JSS)
StorageElement(SE)
ComputingElement (CE)
Information System (IS)
LCGFileCatalog(LFC)
Output Sandbox
cleared
submitted
waiting
ready
scheduled
running
done
Job Status
outputready
Job Status
![Page 15: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/15.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Job Description Language (JDL)
• Based upon Condor’s CLASSified ADvertisement language (ClassAd)
• ClassAd is an extensible language• Sequence of attributes (key,value pairs) separated by
semi-colons.
Executable = “/bin/echo”;Arguments = “Goedemiddag”;StdError = “stderr.log”;StdOutput = “stdout.log”;OutputSandbox = {“stderr.log”, “stdout.log”};
![Page 16: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/16.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Types of Attributes
• The supported attributes are grouped in two categories:
– JobDefine the job itself
– Resources Taken into account by the RB for carrying out the
matchmaking algorithm Computing Resource (Attributes)
Used to build expressions of Requirements and/or Rank attributes by the user
Have to be prefixed with “other.” Data and Storage resources (Attributes)
Input data to process, SE where to store output data, protocols spoken by application when accessing SEs
![Page 17: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/17.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Job Definition Attributes
• Executable (mandatory)– The command name
• Arguments (optional)– Job command line arguments
• StdInput, StdOutput, StdErr (optional)– Standard input/output/error of the job
• Environment (optional)– List of environment settings
• InputSandbox (optional)– List of files on the UI local disk needed by the job for running– The listed files are staged from the UI to the remote CE
• OutputSandbox (optional)– List of files, generated by the job, which have to be retrieved
![Page 18: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/18.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Resource Attributes
• Requirements– Job requirements on computing resources – Specified using attributes of resources published in the Information
System– If not specified, default value defined in UI configuration file is
considered Default: other.GlueCEStateStatus == "Production" (the resource has to be in the
Production grid)
• Rank– Expresses preference (how to rank resources that have already met the
Requirements expression)– Specified using attributes of resources published in the Information
Service– If not specified, default value defined in the UI configuration file is
considered Default: - other.GlueCEStateFreeCPUs (the highest number of free CPUs)
![Page 19: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/19.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
“Data” Attributes
• InputData (optional)– Refers to data used as input by the job: these data are published
in the Replica Catalog and stored in the SEs)– PFNs and/or LFNs
• DataAccessProtocol (mandatory if InputData specified)– The protocol or the list of protocols which the application is able
to speak with for accessing InputData on a given SE
• OutputSE (optional)– The hostname of the output SE– RB uses it to choose a CE that is compatible with the job and is
close to SE
• OutputData (optional)– Output Data that will be registered at the end of the job
![Page 20: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/20.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Example JDL File
Executable = “gridTest”;
StdError = “stderr.log”;
StdOutput = “stdout.log”;
InputSandbox = {“/home/joda/test/gridTest”};
OutputSandbox = {“stderr.log”, “stdout.log”};
InputData = “lfn:/grid/tutor/testbed0-00019”;
DataAccessProtocol = “gridftp”;
Requirements = other.Architecture==“INTEL” && \ other.OpSys==“LINUX” && other.FreeCpus >=4;
Rank = “other.GlueHostBenchmarkSF00”;
![Page 21: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/21.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Job Submission
• edg-job-submit [–r <res_id>] [–n <user e-mail address>] [-c <config file>] [-o <output file>] <job.jdl>-r the job is submitted by the RB directly to the computing element
identified by <res_id>
-c the configuration file <config file> is used by the UI instead of the standard configuration file
-o the generated edg_jobId is written in the <output file>
Useful for other commands, e.g.:
edg-job-status –i <input file> (or edg_jobId)
-i the status information about edg_jobId contained in the <input file> are displayed
--vo the VO under which the job will be run
![Page 22: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/22.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Other WMS UI Commands
• edg-job-list-matchLists resources matching a job description
Performs the matchmaking without submitting the job
• edg-job-cancelCancels a given job
• edg-job-statusDisplays the status of the job
• edg-job-get-outputReturns the job-output (the OutputSandbox files) to the user
• edg-job-get-logging-infoDisplays logging information about submitted jobs (all the events
“pushed” by the various components of the WMS)
Very useful for debug purposes
![Page 23: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/23.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
WMS Match Making
• The RB is the core component of WMS. • It has to find the best suitable computing resource (CE) where the
job will be executed• It interacts with Data Management service and Information System
They supply RB with all the information required for the resolution of the matches
• The CE chosen by RB has to match the job requirements (e.g. runtime environment, data access requirements, and so on)
• If 2 or more CEs satisfy all the requirements, the one with the best Rank is chosen
![Page 24: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/24.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Direct Job submission
• The RB has to deal with three possible scenarios.
Scenario 1: Direct Job Submission Job is scheduled on a given CE (specified in the edg-job-
submit command via –r option) RB doesn’t perform any matchmaking algorithm Take care if InputData is specified!
![Page 25: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/25.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Brokered Job Submission, No InputData
Scenario 2: Job Submission without data-access Requirements Neither CE nor input data are specified. RB starts the matchmaking algorithm, which consists of two
phases: • Requirements check (RB contacts the IS to check which
CEs satisfy all the requirements)• If more than one CE satisfies the job requirements, the CE
with the best rank is chosen by the RB
![Page 26: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/26.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Brokered Job Submission, Grid Data
Scenario 3: CE is not specified in the JDL RB contacts Data Management service to find out which SE’s
have copies of the requested input data sets RB makes best effort match between
• Computing resources for which user is authorized• SE’s “nearby” which can provide the requested data sets via the
requested transfer protocol• Any optional output SE specified in the job description
RB strategy consists of submitting jobs close to data! The main two phases of the match making algorithm remain
unchanged:• Requirements check• Rank computation
The matchmaking is only performed for CEs satisfying the data-access requirements (i.e. which are close to data)
![Page 27: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/27.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Proxy Renewal
• Why?– To avoid job failure because it outlived the validity of the initial proxy
• WMS support automatic proxy renewal mechanism as long as the user credentials are handled by a proxy server.1. Create a proxy using
voms-proxy-init
2. Register this proxy with the MyProxy server using myproxy-init –s <server> [-t <cred> -c <proxy>] –d -n
server is the server address (e.g. px.matrix.sara.nl)
cred is the number of hours the proxy should be valid on the server
proxy is the number of hours renewed proxies should be valid
3. Short term proxies can then be used to start jobs usinggrid-proxy-init –hours <hours> command
4. The Proxy is automatic renewed by WMS without user intervention for all the job life
![Page 28: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/28.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
MPI jobs
• MPI– Message passing– Link with parallel library– Run on multiple processors
• gLite– Limited support– Some sites can run MPI jobs
• JobType– JobType=”MPICH”;– NodeNumber = 8;– Adds MPICH support as requirement– Executable run in paralllel on 8 CPU’s
![Page 29: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/29.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Other JobTypes
• Interactive– StdOutput, StdInput and StdError forwarded to user– default X window– Other tools
• Checkpointable– Job must save checkpoints– Checkpoints can be retrieved– Not fully supported yet
![Page 30: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/30.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Further Information
• The gLite User Guide!
http://glite.web.cern.ch/glite/documentation/default.asp
• ClassAdhttps://www.cs.wisc.edu/condor/classad/
• Sara Grid pageshttp://www.sara.nl/userinfo/grid/
![Page 31: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/31.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
UI configuration file
• Can be set if (expert) user is not happy with default one • Most relevant attributes:
– RB(s) When submitting a job, the first specified RB is tried, if the operation fails
the second one is considered, etc.– LBserver(s)
The LB to be used for a job is chosen by the RB So when a edg-job-status <edg-jobid> is issued, the LB to contact is
specified in the edg-jobid This list specifies the LB(s) that must be contacted when issuing a edg-job-
status –all / edg-job-get-logging-info –all (to have information for all the jobs belonging to that user)
– Default JDL Requirements other.GlueCEStateStatus == "Production"
– Default JDL Rank other.GlueCEStateFreeCPUs;
– Default Virtual Organisation Which VO the job should use to run
![Page 32: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/32.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
UI Command Error Messages
• The UI commands accept some arguments in input. If the user makes a mistake via command line, the following messages can appear:
Argument * is not allowed (the argument is not known)
Argument * must be specified at the end of the command (both the jobId and JDL file name must be put at the end of the command line)
Argument * is missing for the “—output” option (the user forgot to add the parameter, required by the argument)
Argument “-all” cannot be specified with argument “—input” (some arguments are OR-exclusive)
CEId format is: <full hostname>;<port number>/jobmanager-<service>. The provided CEID: “http://lx01.absolute.com:10854/jobmanager” has a wrong format. (the user has mis-spelled the CE identifier after –resource)
![Page 33: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/33.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Resource Broker errors
• During the calling of the RB API, the following can happen:
Resource Broker “grid013g.cnaf.infn.it:7771” not available (can’t open a connection with the RB specified in the UI configuration file)
Unable to get LB address from RB “grid013g.cnaf.infn.it” (the function get_lb_contact returned an error)
![Page 34: Job Submission](https://reader034.fdocuments.us/reader034/viewer/2022051418/56814f51550346895dbcf6b3/html5/thumbnails/34.jpg)
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
JDL & Proxy Error Messages
• While the UI commands are checking the JDL file, the following errors may occur:
Mandatory Attribute default error in the configuration file “/opt/edg/etc/UI_ConfigENV.cfg” (there aren’t any default values)
Mandatory Attribute missing in JDL file “Executable” (Executable is one of the mandatory attributes)
Multiple “InputSandbox” attribute found in JDL file (InputSandbox attribute is repeated twice)
Wrong function call for list attribute *. Function usage is: “Member/IsMember(List, Value)” (e.g. in the requirements attribute the function Member/IsMember is used with a wrong syntax)
• Proxy (this refers to the security grid proxy and not to a proxy machine)– If the user specifies a duration for the proxy that he wants to provide,
using the option –h of edg-job-submit, a possible message isProxy certificate will expire in less then X hours. Creating a new X-hours-
duration certificate (this to make sure that at least the required proxy validity is granted )