Condor DAGMan: Introduction & Update
-
Upload
gillian-crosby -
Category
Documents
-
view
26 -
download
1
description
Transcript of Condor DAGMan: Introduction & Update
Peter CouvaresComputer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/condor
Condor DAGMan:Introduction &
Update
2http://www.cs.wisc.edu/condor
DAGMan
› Directed Acyclic Graph Manager
› DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.
› (e.g., “Don’t run job “B” until job “A” has completed successfully.”)
3http://www.cs.wisc.edu/condor
Why is This Important?
› Most real science involves complex sequences of tasks – on many resources at many sites. E.g., move data, compute, check, move back, etc.
› … and many types of jobs working together Condor, Grid (Condor-G), MPI, shell scripts, etc.
› Failures are a certainty, so recoverability of the sequence – not just the jobs – is crucial.
4http://www.cs.wisc.edu/condor
What is a DAG?
› A DAG is the data structure used by DAGMan to represent these dependencies.
› Each job is a “node” in the DAG.
› Each node can have any number of “parent” or “children” nodes – as long as there are no loops!
Job A
Job B Job C
Job D
5http://www.cs.wisc.edu/condor
Defining a DAG
› A DAG is defined by a .dag file, listing each of its nodes and their dependencies:# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D
› each node will run the Condor or Grid job specified by its accompanying Condor submit file
Job A
Job B Job C
Job D
6http://www.cs.wisc.edu/condor
Submitting a DAG
› To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon to begin running your jobs:
% condor_submit_dag diamond.dag
› condor_submit_dag submits a Scheduler Universe job to run DAGMan under Condor… so DAGMan itself will be robust in case of failure, machine reboots, etc.
7http://www.cs.wisc.edu/condor
DAGMan
Running a DAG
› DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.
CondorJobQueue
C
D
A
A
B.dagFile
8http://www.cs.wisc.edu/condor
DAGMan
Running a DAG (cont’d)
› DAGMan holds & submits jobs to the Condor queue at the appropriate times.
CondorJobQueue
C
D
B
C
B
A
9http://www.cs.wisc.edu/condor
DAGMan
Running a DAG (cont’d)
› In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG.
CondorJobQueue
X
D
A
BRescue
File
10http://www.cs.wisc.edu/condor
DAGMan
Recovering a DAG
› Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG.
CondorJobQueue
C
D
A
BRescue
File
C
11http://www.cs.wisc.edu/condor
DAGMan
Finishing a DAG
› Once the DAG is complete, the DAGMan job itself is finished, and exits.
CondorJobQueue
C
D
A
B
12http://www.cs.wisc.edu/condor
Additional DAGMan Features
› Provides other knobs handy for job management…
nodes can have PRE & POST scripts job submission can be “throttled” NEW: failed nodes can be
automatically re-tried a configurable number of times
13http://www.cs.wisc.edu/condor
PRE & POST Scripts
› Executes locally on the submit host before or after job submission…
› Example:# diamond.dagPRE A prepare-A.shJob A a.subJob B b.subJob C c.subJob D d.subPOST D double-check.shParent A Child B CParent B C Child D
› PRE/POST scripts are part of node
PREJob A
Job B Job C
Job DPOST
14http://www.cs.wisc.edu/condor
DAG “Throttling”
› You can tell DAGMan to limit the maximum number of jobs it submits at any one time condor_submit_dag -maxjobs N useful for managing resource limitations (e.g.,
licenses)
› You can also can limit the number of simultaneous PRE or POST scripts. Added after Vladimir Litvin’s 7000-node DAG
started 7000 PRE scripts on his machine!
15http://www.cs.wisc.edu/condor
Node RETRY
› Tells DAGMan to re-run a node multiple times if necessary…
› Example:# diamond.dagJob A a.subJob B b.subRETRY B 5Job C c.subRETRY C 5Job D d.subParent A Child B CParent B C Child D
Job A
Job B Job C
Job D
16http://www.cs.wisc.edu/condor
DAGMan Progress
› Testing… lots of testing. 10,000+ node DAGs run smoothly Developed automated DAG testing
tools to generate random DAGs and test for correct execution (Ning Lin & Will McDonald)
Lots of bugs fixed
17http://www.cs.wisc.edu/condor
DAGMan Progress (cont’d)
› New features Improved logging (timestamps, etc.) More efficient recovery Node RETRY capability DAG info in condor_q (with –dag flag) Robust in more failure cases Recursive DAGs for conditional execution
› DAGMan for Windows (Ray Pingree)
18http://www.cs.wisc.edu/condor
DAGMan Success
› DAGMan is becoming part of the common framework for running on the grid. Particle Physics Data Grid (PPDG) Grid Physics Network (GriPhyN) Many Super Computing 2001 demos more…
19http://www.cs.wisc.edu/condor
DAGMan in the GriPhyN ArchitectureApplication
Planner
Executor
Catalog Services
Info Services
Policy/Security
Monitoring
Repl. Mgmt.
Reliable TransferService
Compute Resource Storage Resource
DAG
DAG
DAGMAN, Kangaroo
GRAM GridFTP; GRAM; SRM
GSI, CAS
MDS
MCAT; GriPhyN catalogs
GDMP
MDS
Globus
diagram by Ian Foster (Argonne)
DAGMan in PPDG Tools
diagram by Jim Amundson (Fermilab)
21http://www.cs.wisc.edu/condor
What’s Next?
› More flexible control of node execution Currently implicit: “all my parents returned
0”. Why not, “all parents returned 0 AND ran for
more than two hours” or “parent A returned 0 and parent B returned 42”?
› 1st step: represent DAG nodes internally as ClassAds Allows DAGMan to decide when to run
nodes based on arbitrary requirements
22http://www.cs.wisc.edu/condor
What’s Next? (cont’d)
› Extend DAGMan to utilize DaP Scheduler (DaP?) to intelligently schedule data transfers along with Condor and Condor-G jobs.
DAGMan Condor-G
Condor
DaP Scheduler
23http://www.cs.wisc.edu/condor
Thank You!
› Interested in seeing more? Come to the DAGMan BoF
• Wednesday 9am - noon• Room 3393, Computer Sciences (1210 W. Dayton
St.)
Email us:• [email protected]
Try it!• http://www.cs.wisc.edu/condor