OGF 19 Condor Software Forum Condor-G

Post on 18-Jan-2016

97 views 5 download

description

OGF 19 Condor Software Forum Condor-G. What Is It?. Condor-G is a specialization of Condor. It is also known as the “grid universe”. Condor-G speaks many different job management protocols. Condor-G benefits from all the wonderful Condor features, like a real job queue. Grid Fault-Tolerance. - PowerPoint PPT Presentation

Transcript of OGF 19 Condor Software Forum Condor-G

Jaime Frey, Todd TannenbaumComputer Sciences DepartmentUniversity of Wisconsin-Madison{jfrey|tannenba}@cs.wisc.eduhttp://www.cs.wisc.edu/condor

OGF 19Condor Software Forum

Condor-G

www.cs.wisc.edu/condor

What Is It?

› Condor-G is a specialization of Condor. It is also known as the “grid universe”.

› Condor-G speaks many different job management protocols.

› Condor-G benefits from all the wonderful Condor features, like a real job queue.

www.cs.wisc.edu/condor

Grid Fault-Tolerance

› Condor-G does whatever it takes to run your jobs, even if … Your local machine machine crashes The grid service is temporarily

unavailable The network goes down

www.cs.wisc.edu/condor

Remote Resource Access: Globus

“globusrun myjob …”

Globus GRAM ProtocolGlobus

JobManager

fork()

Organization A Organization B

www.cs.wisc.edu/condor

GlobusGlobus GRAM Protocol

Globus JobManager

fork()

Organization A Organization B

“globusrun myjob …”

www.cs.wisc.edu/condor

Globus + Condor

Globus GRAM Protocol Globus JobManager

Submit to Condor

Condor PoolOrganization A Organization B

“globusrun myjob …”

www.cs.wisc.edu/condor

Globus + Condor

“globusrun …”

Globus GRAM Protocol Globus JobManager

Submit to Condor

Condor PoolOrganization A Organization B

www.cs.wisc.edu/condor

Condor-G + Globus + Condor

Globus GRAM Protocol Globus JobManager

Submit to Condor

Condor PoolOrganization A Organization B

Condor-GCondor-G

myjob1myjob2myjob3myjob4myjob5…

www.cs.wisc.edu/condor

Condor-G Fault-Tolerance:Lost Contact with Remote

JobmanagerCan we contact gatekeeper?

Yes – network was downNo – machine crashed

or job completed

Yes - jobmanager crashed No – retry until we can talk to gatekeeper again…

Can we reconnect to jobmanager?

Has job completed?

No – is job still running?

Yes – update queue

Restart jobmanager

www.cs.wisc.edu/condor

Just to be fair…

› The gatekeeper doesn’t have to submit to a Condor pool. It could be PBS, LSF, Sun Grid

Engine…

› Condor-G will work fine whatever the remote batch system is.

www.cs.wisc.edu/condor

Other Condor-G Features

› Other Grid Protocols Works with WS-GRAM, NorduGrid, Unicore

› Credential Management Pull refreshed credentials from MyProxy Push refreshed credentials to remote systems

› Job Scheduling Use Matchmaking to select resources for jobs

› GlideIn Allows late binding of resources and job

checkpoint/migration

www.cs.wisc.edu/condor

Condor-G

Condor-GCondor-G

Job Description (Job ClassAd)

GT2 [.1|2|4]

HTTPSCondor PBS/LSF NorduGrid

GT4

WSRFUnicore

www.cs.wisc.edu/condor

Pre-WS GRAM

› Submit filegrid_resource = gt2 \ foo.edu/jobmanager-pbsglobus_rsl = (queue=long)\ (condor_submit=(universe java))

www.cs.wisc.edu/condor

OGSA GRAM

› Submit filegrid_resource = gt3 http://foo.edu/\ ogsa/services/base/gram/\ PBSManagedJobFactoryServiceglobus_rsl = (queue=long)\ (condor_submit=(universe java))

› Museum mode

www.cs.wisc.edu/condor

WS GRAM

› Submit filegrid_resource = gt4 foo.edu PBSglobus_xml = <queue>long</queue>

www.cs.wisc.edu/condor

NorduGrid

› Submit filegrid_resource = nordugrid foo.edunordugrid_rsl = (queue=long)

www.cs.wisc.edu/condor

Unicore

› Submit filegrid_resource = unicore usite.org vsitekeystore_file = keystorekeystore_passphrase_file = keystore.pwkeystore_alias = my cert

www.cs.wisc.edu/condor

Condor

› Submit filegrid_resource = condor schedd.foo.edu \ cm.foo.eduremote_universe = java

www.cs.wisc.edu/condor

PBS

› Submit filegrid_resource = pbs

www.cs.wisc.edu/condor

LSF

› Submit filegrid_resource = lsf

www.cs.wisc.edu/condor

Grid Universe Fault-Tolerance: Credential

Management› Authentication in many grid protocols is done

with limited-lifetime X509 proxies› Proxy may expire before jobs finish executing› Condor can put jobs on hold and email user to

refresh proxy› Condor can automatically retrieve new proxies

from MyProxy› When the proxy is refreshed, Condor forwards

it to the jobs

www.cs.wisc.edu/condor

MyProxy

› Submit fileMyProxyHost = foo.edu:12345MyProxyServerDN = /DC=org/DC=doegrids…MyProxyCredentialName = proxy_fileMyProxyRefreshThreshold = 240 #minsMyProxyNewProxyLifetime = 12 #hrsMyProxyPassword = password

› Or give password on command linecondor_submit -p password submit.desc

www.cs.wisc.edu/condor

Condor-G Matchmaking

› Use Condor-G matchmaking with grid universe jobs

› Allows Condor-G to dynamically assign computing jobs to grid sites

› An example of lazy planning

www.cs.wisc.edu/condor

Condor-G Matchmaking, cont.

› Normally a grid universe job must specify the site in the submit description file via the “grid_resource” attribute like so:

Executable = fooUniverse = gridGrid_Resource = gt2 \

beak.cs.wisc.edu/jobmanager-pbsqueue

www.cs.wisc.edu/condor

Condor-G Matchmaking, cont.

› With matchmaking, grid universe jobs can use requirements and rank:

Executable = fooUniverse = gridGrid_Resource = $$(ResourceName)Requirements = arch == LINUXRank = NumberOfNodes * random()Queue

› The $$(x) syntax inserts information from the target ClassAd when a match is made.

www.cs.wisc.edu/condor

Condor-G Matchmaking, cont.

› Where do these target ClassAds representing Globus gatekeepers come from? Several options: Simple script on gatekeeper publishes an ad via

condor_advertise command-line utility (method used by D0 JIM, USCMS)

Program to query Globus MDS and convert information into ClassAd (method used by EDG)

Run HawkEye with appropriate plugins on the gatekeeper

› For explanation of Condor-G matchmaking setup for USCMS, see http://www.cs.wisc.edu/condor/USCMS_matchmaking.html

www.cs.wisc.edu/condor

Condor-G Matchmaking: Creating

the Resource Ad› Machine AdMyType = “Machine”TargetType = “Job”Name = “foo.edu”Machine = “foo.edu”ResourceName = “gt4 foo.edu PBS”UpdateSequenceNumber = 4Requirements = TARGET.JobUniverse == 9 && \ CurMatches < 10CurMatches = 0NumberOfNodes = 300Rank = 0.0CurrentRank = 0.0WantAdRevaluate = True

www.cs.wisc.edu/condor

Condor-G Matchmaking: Creating

the Resource Ad› Advertising a resourcecondor_advertise UPDATE_STARTD_AD \ ad-file

› Call periodically

› Use unix time for UpdateSequenceNumber

www.cs.wisc.edu/condor

But Wait, There’s More…

› What if you want to run standard universe jobs on grid resources For matchmaking and dynamic scheduling

of jobs For job checkpointing and migration For remote system calls

› What if you don’t want to send a job to a site until the moment the job will start running (late binding)

www.cs.wisc.edu/condor

One Solution: Condor-G GlideIn

› You can use the Grid Universe to run Condor daemons on grid resources

› When the resources run these GlideIn jobs, they will temporarily join your Condor Pool

› You can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the grid resources

www.cs.wisc.edu/condor

yourworkstation

Friendly Condor Pool

personalCondor

600 Condorjobs

Globus Grid

PBS LSF

Condor

Condor Pool

glide-in jobs

www.cs.wisc.edu/condor

GlideIn Concerns

› What if a grid resource kills my GlideIn job? That resource will disappear from your pool and

your jobs will be rescheduled on other machines Standard universe jobs will resume from their

last checkpoint like usual

› What if all my jobs are completed before a GlideIn job runs? If a GlideIn Condor daemon is not matched with

a job in 10 minutes, it terminates, freeing the resource

www.cs.wisc.edu/condor

Condor

schedd(Job caretaker)

condor_submit

matchmaker

Startd(Runs job)

www.cs.wisc.edu/condor

Condor-G

schedd(Job caretaker)

condor_submit

gridmanager gahp

Globus gatekeeper

PBS or LSF

www.cs.wisc.edu/condor

Condor-C

schedd(Job caretaker)

condor_submit

gridmanager condor-gahp

schedd

matchmaker

startd

www.cs.wisc.edu/condor

Condor-C to non-Condor

schedd(Job caretaker)

condor_submit

gridmanager condor-gahp

schedd

gridmanager

pbs/lsf-gahp PBS or LSF

www.cs.wisc.edu/condor

Gliding in Condor-C

schedd(Job caretaker)

condor_submit

gridmanager

gridmanager

pbs/lsf-gahp

PBS or LSFcondor-gahp

gahp

Globusgatekeeper

schedd1. Glide-in

2. Submit jobs

www.cs.wisc.edu/condor

Matchmaking with Condor-C

› In all of these examples, Condor-C went to a specific remote schedd

› This is not required: you can do matchmaking

www.cs.wisc.edu/condor

Matchmaking with Condor-C

schedd(Job caretaker)

condor_submit

gridmanager condor-gahp

matchmaker

schedd

schedd

… submit job

www.cs.wisc.edu/condor