Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Post on 18-Jan-2018

216 views 0 download

description

What Is Condor-G › Use Condor to run jobs on the Grid › Uses Globus Toolkit  GRAM (submit a remote job)  GASS (transfer job’s files) › Two components  Globus Universe  GlideIn

Transcript of Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Jaime FreyComputer Sciences DepartmentUniversity of Wisconsin-Madison

jfrey@cs.wisc.eduhttp://www.cs.wisc.edu/condor

What’s New in Condor-G

www.cs.wisc.edu/condor

Outline› What is Condor-G› Released New Features› In Development

www.cs.wisc.edu/condor

What Is Condor-G› Use Condor to run jobs on the Grid› Uses Globus Toolkit

GRAM (submit a remote job) GASS (transfer job’s files)

› Two components Globus Universe GlideIn

www.cs.wisc.edu/condor

Globus Universe› Run a job on a Grid resource› Features

Job management Fault tolerance Credential management

› Roughly equivalent to the vanilla universe

www.cs.wisc.edu/condor

How It Works

Schedd

LSF

Condor-G Grid Resource

www.cs.wisc.edu/condor

How It Works

Schedd

LSF

Condor-G Grid Resource

600 Globusjobs

www.cs.wisc.edu/condor

How It Works

Schedd

LSF

Condor-G Grid Resource

GridManager

600 Globusjobs

www.cs.wisc.edu/condor

How It Works

Schedd JobManager

LSF

Condor-G Grid Resource

GridManager

600 Globusjobs

www.cs.wisc.edu/condor

How It Works

Schedd JobManager

LSF

User Job

Condor-G Grid Resource

GridManager

600 Globusjobs

www.cs.wisc.edu/condor

GlideIn› Run the Condor daemons on Grid

resources as user jobs› Create your own personal Condor pool

from temporarily-acquired Grid resources

› Brings the full power of Condor to the Grid

www.cs.wisc.edu/condor

Globus Grid

PBS LSF

Condor

Condor-G

www.cs.wisc.edu/condor

Globus Grid

PBS LSF

Condor

600 Condorjobs

Condor-G

www.cs.wisc.edu/condor

Condor-G

Globus Grid

PBS LSF

Condor

600 Condorjobs

www.cs.wisc.edu/condor

Condor-G

Globus Grid

PBS LSF

Condor glide-ins

600 Condorjobs

www.cs.wisc.edu/condor

Condor-G

Globus Grid

PBS LSF

Condor glide-ins

600 Condorjobs

www.cs.wisc.edu/condor

Condor-G

Globus Grid

PBS LSF

Condor glide-ins

600 Condorjobs

www.cs.wisc.edu/condor

Condor-G

Globus Grid

PBS LSF

Condor glide-ins

600 Condorjobs

www.cs.wisc.edu/condor

Released New Features› Stuff we’ve added in the past year› Released and ready for use in

Condor 6.6

www.cs.wisc.edu/condor

Globus ASCII Helper Protocol (GAHP)

› Encapsulates Globus libraries in separate process

› Simple ASCII protocol› Easy for legacy applications to use

Globus when they can’t link directly with the libraries

www.cs.wisc.edu/condor

How It Works - GAHP

Schedd JobManager

Condor-G Grid Resources

GridManager

JobManager

JobManagerGAHP Client

GAHP Server

www.cs.wisc.edu/condor

File Staging› Arbitrary input and output files can

be staged to and from execution site

› Same syntax as other universes› Limitation

Output files must be explicitly named

www.cs.wisc.edu/condor

File Staging (cont)› Input, Output, and Error can be

URLs Files will be transferred directly to

and from execution site› Output and Error can be staged or

streamed

www.cs.wisc.edu/condor

Credential Refresh› Renewed credentials are used by

Condor-G and forwarded to the execution site automatically

› No processes need to be restarted

www.cs.wisc.edu/condor

Better Credential Management

› One GridManager process can handle multiple credential files with same subject

› More efficient when you want to have different credential lifetimes for different jobs

www.cs.wisc.edu/condor

Grid Match-Making› Globus jobs matched with Globus

resources by the Condor match-maker using ClassAds

› Current limitation User/admin must create resources

ads

www.cs.wisc.edu/condor

Fault Tolerance› Condor-G does its best to automatically

recover from failures› User can guide decisions with job policy

expressions Periodic Release GlobusResubmit Rematch

www.cs.wisc.edu/condor

PeriodicRelease Expression

› Condor-G puts problematic jobs on hold

› This expression tells Condor-G when to release and retry such jobs

www.cs.wisc.edu/condor

GlobusResubmit Expression

› Tells Condor-G when a problematic job submission should be abandoned

› When this expression becomes true Best effort is made to clean up current

job submission New job submission is attempted

www.cs.wisc.edu/condor

Rematch Expression› Tells Condor-G when a problematic

resource should be abandoned› Evaluated when GlobusResubmit

evaluates to true› When this expression becomes true

Best effort is made to clean up current job submission

Job is rematched

www.cs.wisc.edu/condor

Job Ad ExampleGlobusContactString = TARGET.gatekeeper_urlRequirements = TARGET.Arch == “LINUX” &&

TARGET.OpSys == “LINUX”Rank = TARGET.MflopsPeriodicRelease = ((NumMatches < 10) &&

((CurrentTime-EnteredCurrentStatus) > 600))GlobusResubmit = NumSystemHolds >= NumMatchesRematch = True

www.cs.wisc.edu/condor

Hardening› Regular testing on the CMS testbed

with real applications› Many bugs and integration issues

found and fixed Hostile Environment

www.cs.wisc.edu/condor

Hostile Environment› Full disks› Machine crashes› File server lock-ups› Network outages› Power outages

www.cs.wisc.edu/condor

One CMS Dataset Run› 300 jobs› Last fall

~50 (16%) of the jobs stalled and required human recovery

Multiple service restarts (20 daemon crashes over 6 hours)

› Now 0 jobs stalled 0 service restarts

www.cs.wisc.edu/condor

Integration Work› Dozens of Condor-G improvements

and bug fixes› Over 40 Globus “bugzilla”

incidents, many with patches Globus 2.2.4 has 21 “Advisories” as of

4/11/04› Use latest version of both

www.cs.wisc.edu/condor

Scalability› Submitting several hundred jobs

produced high load on server Machine became unresponsive We saw a load average of 1000 at

one point› Caused Globus JobManager

processes

www.cs.wisc.edu/condor

Grid Manager Monitor Agent

› New tool Condor-G can use to reduce this load

› Efficient job status polling program› Allows Condor-G to shut down

JobManager processes when they’re not needed

www.cs.wisc.edu/condor

Load Reduced› 400 jobs (/bin/sleep 900)› Without Grid Monitor

42 hours to complete Peak load average of 610

› With Grid Monitor 40 minutes Peak load average of 104

www.cs.wisc.edu/condor

Miscellaneous Stuff› Email notification on job

completion› Port range restrictions› Problem jobs put on hold

www.cs.wisc.edu/condor

In Development› Stuff we’re currently working on› Will be released sometime in the

next year

www.cs.wisc.edu/condor

Job Policy Expressions› PeriodicHold› PeriodicRemove› OnExitHold› OnExitRemove

www.cs.wisc.edu/condor

Improved GlideIn› MDS use optional

User specifies necessary information› Automatic setup

GlideIn job transfers and installs binaries if needed

Binaries can come from submit machine

www.cs.wisc.edu/condor

New Job Types› Submit jobs directly to other

schedulers (not through Globus)› Why?

Richer interface semantics Not supported by Globus

www.cs.wisc.edu/condor

NorduGrid› Grid batch system designed by

Nordic countries› Globus GRAM didn’t offer

necessary semantics Client control of file staging Automatic cleanup of abandoned jobs

www.cs.wisc.edu/condor

Oracle› Oracle DBMS supports a job queue

Run this query in 5 hours Run this query every Monday

› Condor can add more management features

www.cs.wisc.edu/condor

Generic Job Interface› Re-arrange GridManager to allow

easy addition of new job types› Define appropriate interface› Plug-ins for new job types?

www.cs.wisc.edu/condor

Globus Toolkit 3.0› OGSA (Open Grid Services

Architecture)› Submit jobs to GT3 sites› Grid Service client interface to

Condor-G

www.cs.wisc.edu/condor

Miscellaneous› Condor-G for Windows› MyProxy credential management› URLs for executable, staged files

www.cs.wisc.edu/condor

Thank You!› Questions?› Also…

Condor-G & Globus Q/A session• Wednesday, 9am-12pm, room TBA

E-mail condor-admin@cs.wisc.edu