Condor-G: A Case in Distributed Job Delegation

Post on 10-Feb-2016

26 views 0 download

description

Condor-G: A Case in Distributed Job Delegation. Job Delegation. Transfer of responsibility to schedule and execute a job Multiple delegations can form a chain. Job Delegation in Condor-G Today. Globus GRAM. Batch System Front-end. Execute Machine. Condor-G. Expanding the Model. - PowerPoint PPT Presentation

Transcript of Condor-G: A Case in Distributed Job Delegation

Jaime FreyComputer Sciences DepartmentUniversity of Wisconsin-Madison

jfrey@cs.wisc.eduhttp://www.cs.wisc.edu/condor

Condor-G: A Case in Distributed Job

Delegation

www.cs.wisc.edu/condor

Job Delegation› Transfer of responsibility to

schedule and execute a job› Multiple delegations can form a

chain

www.cs.wisc.edu/condor

Job Delegation in Condor-G Today

Condor-G

Globus GRAM

Batch System Front-end

Execute Machine

www.cs.wisc.edu/condor

Expanding the Model› What can we do with new forms of job

delegation?› Some ideas

Mirroring Load-balancing Glide-in schedd Multi-hop grid scheduling

www.cs.wisc.edu/condor

Mirroring› What it does

Jobs mirrored on two Condor-Gs If primary Condor-G crashes, secondary one

starts running jobs On recovery, primary Condor-G gets job

status from secondary one› Removes Condor-G submit point as

single point of failure

www.cs.wisc.edu/condor

Mirroring Example

Condor-G 1

Matchmaker

Execute Machine

Condor-G 2

www.cs.wisc.edu/condor

Mirroring Example

Condor-G 1

Matchmaker

Execute Machine

Condor-G 2

www.cs.wisc.edu/condor

Load-Balancing› What it does

Front-end Condor-G distributes all jobs among several back-end Condor-Gs

Front-end Condor-G keeps updated job status

› Improves scalability› Maintains single submit point for users

www.cs.wisc.edu/condor

Load-Balancing Example

Condor-G Back-end 1

Condor-G Front-end

Condor-G Back-end 3

Condor-G Back-end 2

www.cs.wisc.edu/condor

Glide-In Schedd› What it does

Drop a Condor-G onto the front-end machine of a cluster

Delegate jobs to the cluster through the glide-in schedd

› Apply cluster-specific policies to jobs

www.cs.wisc.edu/condor

Glide-In Schedd Example

Condor-G

Glide-In Schedd

Batch System

www.cs.wisc.edu/condor

Multi-Hop Grid Scheduling

› Match a job to a Virtual Organization (VO), then to a resource within that VO

› Easier to schedule jobs across multiple VOs and grids

www.cs.wisc.edu/condor

Multi-Hop Grid Scheduling Example

Experiment Condor-G

Experiment Resource Broker

VO Condor-G

VO Resource Broker

Globus GRAM

Batch Scheduler

www.cs.wisc.edu/condor

Endless Possibilities› These new models can be

combined with each other or with other new models

› Resulting system can be arbitrarily sophisticated

www.cs.wisc.edu/condor

Job Delegation Challenges

› New complexity introduces new issues and exacerbates existing ones

› A few… Transparency Representation Scheduling Control Active Job Control Revocation Error Handling and Debugging

www.cs.wisc.edu/condor

Transparency› Full information about job should be

available to user Information from full delegation path No manual tracing across multiple machines

› Users need to know what’s happening with their jobs

www.cs.wisc.edu/condor

Representation› Job state is a vector› How best to show this to user

Summary• Current delegation endpoint• Job state at endpoint

Full information available if desired• Series of nested ClassAds?

www.cs.wisc.edu/condor

Scheduling Control› Avoid loops in delegation path› Give user control of scheduling

Allow limiting of delegation path length?

Allow user to specify part or all of delegation path

www.cs.wisc.edu/condor

Active Job Control› User may request certain actions

hold, suspend, vacate, checkpoint› Actions cannot be completed

synchronously for user Must forward along delegation path User checks completion later

www.cs.wisc.edu/condor

Active Job Control (cont)

› Endpoint systems may not support actions If possible, execute them at furthest

point that does support them› Allow user to apply action in

middle of delegation path

www.cs.wisc.edu/condor

Revocation› Leases

Lease must be renewed periodically for delegation to remain valid

Allows revocation during long-term failures

› What are good values for lease lifetime and update interval?

www.cs.wisc.edu/condor

Error Handling and Debugging

› Many more places for things to go horribly wrong

› Need clear, simple error semantics› Logs, logs, logs

Have them everywhere

www.cs.wisc.edu/condor

Current Status› Done

Mirroring› In Progress

Condor-G -> Condor-G delegation• User must specify hops

Glide-in schedd• Set up by hand

www.cs.wisc.edu/condor

Thank You!› Questions?