Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19...

Post on 27-Mar-2015

225 views 0 download

Tags:

Transcript of Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19...

Jaime FreyComputer Sciences DepartmentUniversity of Wisconsin-Madison

jfrey@cs.wisc.eduhttp://www.cs.wisc.edu/condor

OGF 19Condor Software Forum

Routing Jobs to the Grid

www.cs.wisc.edu/condor

Schedd

Job Routera.k.a.

ScheddOn The

Side

What’s a Job Router?Specialized scheduler operating on schedd’s jobs.

Job 1Job 2Job 3Job 4Job 5…Job 4*

job queue

www.cs.wisc.edu/condor

Adapted Quill Technology

› Using Quill library to mirror job queue in memoryo Efficient - just “tails” the logo Independent - mirror without clogging

schedd command queue

› Modifying the job queue is another matter - must interact with schedd

www.cs.wisc.edu/condor

Usage Case

Routing: Vanilla -> Grid

www.cs.wisc.edu/condor

Condor Farm Story

Schedd

StartdResources

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

Application

condor_submit

job queue

•Now that this is working, howcan I use my collaborator’sresources too?

www.cs.wisc.edu/condor

Option #1: Merge Farms

› Combine machines with collaborator into one Condor resource pool.o Everything works just like it did before.o Excellent option for small to medium clusters.o Requires bidirectional connectivity to all

startds, or equivalent via GCB.o Requires some administrative coordination

(e.g. upgrades, negotiator policy, security, etc.)

www.cs.wisc.edu/condor

Option #1b: submit to multiple pools

› condor_submit -remote …

› Works

› Ok for small scale

› Have to manually partition jobs

www.cs.wisc.edu/condor

Option #2: Flocking Together

Schedd

LocalStartds

RemoteStartds

•full featured(std universe etc)•automatic matchmaking•easy to configure

•requires bidirectionalconnectivity•both sites must runcondor

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

www.cs.wisc.edu/condor

Gatekeeper

X

Option #3: Grid Universe

Schedd

Startds

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed Random

SeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

•easier to live with private networks•may use non-Condor resources

•restricted Condor feature set(e.g. no std universe over grid)•must pre-allocating jobsbetween vanilla and grid universe

vanilla site X

www.cs.wisc.edu/condor

Option #4: Routing Jobs

Schedd

LocalStartds

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeed Random

SeedRandomSeed

RandomSeed Random

SeedRandomSeed

RandomSeed Random

SeedRandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

ScheddOn The

Side Gatekeeper

X

Y

Z

vanilla site X

RandomSeed

RandomSeed

site Y site Z

•dynamic allocation of jobsbetween vanilla and grid universes.•not every job is appropriate fortransformation into a grid job.

www.cs.wisc.edu/condor

Example Routing Table

[GridResource = “gt2 gatekeeper.site1/jobmanager-pbs”; MaxJobs = 500; MaxIdle = 50; set_GlobusRSL = “(…)”][GridResource = “condor schedd.site2 collector.site2”; MaxJobs = 700; MaxIdle = 100; Requirements = other.ImageSize < 500]…

www.cs.wisc.edu/condor

What About I/O?

› Jobs must be sandboxable (i.e. specifying input/output via transfer-files mechanism).

› Routing of standard universe is not supported.

› Must have enough storage space at site for input/output files!

www.cs.wisc.edu/condor

What Types of Grids?› Routing table may contain any

combination of grid types supported by Condor’s grid universe.

› Example: Condor-C

Schedd

ScheddOn The

Side

Schedd X

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeed

site X

•for two Condor sites, schedd-to-scheddsubmission requires no additional software•however, still not as trivial to use as flocking

www.cs.wisc.edu/condor

Source Routing

› Routing the old-fashioned way:

universe = GridGridResource = condor site1 …remote_universe = Gridremote_GridResource = condor site2 …remote_remote_universe = Gridremote_remote_GridResource = pbs

www.cs.wisc.edu/condor

Routing At the Site

Gatekeeper

XSchedd

ScheddOn The

Side

Schedd X3

X2

•navigate internal firewalls•provide custom routesfor special users•improve scalability•However, keep in mindI/O requirements etc.

www.cs.wisc.edu/condor

Multicast in Future?

› Currently: route one job to one site

› Multicast: route one job to many sites

› Thin out all but first to germinate

› … or all but first to yield fruit.

www.cs.wisc.edu/condor

Future Glidein FactoryGatekeeper

X

Schedd

Startds

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

•true late binding of jobs to resources•may run on top of non-Condor sites•supports full feature-set of Condor(e.g. standard universe)

•requires GCB for private networks

homesite X

ScheddOn The

Side

glidein jobs

www.cs.wisc.edu/condor

Glideing in the Factory

Schedd

ScheddOn The

Side

glidein factory

site X

schedd-to-schedd

schedd-to-gatekeeper

•hierarchical strategy for scalabilityand reliability•better match for private networks

•may require some additional horsepowerfrom gatekeeper machine, perhaps adedicated element for “edge services”.

RandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

www.cs.wisc.edu/condor

Pluggable Router

› Beyond simple ClassAd transforms

› Pluggins would fire when job matches entry in routing table

› Don’t yet understand semantics

› There is work to do!

www.cs.wisc.edu/condor

Thanks

Interested?Let us know.

We are currentlyusing job routingfor specific usersat UW. Jaime Frey

jfrey@cs.wisc.edu

Future developmentwill focus on moreuse-cases.