What’s new in Condor? Condor Week 2006

58
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor What’s new in Condor? Condor Week 2006

description

What’s new in Condor? Condor Week 2006. So Todd… where is v6.8? Well, v6.7 has been a challenge…. Around since the 80’s. Around since the 80’s. 80’s Mullet Boy. 100 people surveyed! Favorite “ility” ?. 100 people surveyed! Favorite “ility” ?. Deployability!. Existing Ports. - PowerPoint PPT Presentation

Transcript of What’s new in Condor? Condor Week 2006

Page 1: What’s new in Condor? Condor Week 2006

Todd TannenbaumComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

What’s new in Condor?Condor Week 2006

Page 2: What’s new in Condor? Condor Week 2006

2

So Todd… where is v6.8?

Well, v6.7 has been a challenge…

Page 3: What’s new in Condor? Condor Week 2006

3

inint

Page 4: What’s new in Condor? Condor Week 2006

4

Changes Per Condor Version

0

10

20

30

40

50

60

6.7.19 6.7.16 6.7.13 6.7.10 6.7.7 6.7.3 6.7.0 6.6.10 6.6.7 6.6.4 6.6.1 6.5.4 6.5.1 6.4.7 6.4.2 6.3.3 6.3.0 6.2.0

Bugs Fixed

New Features

Page 5: What’s new in Condor? Condor Week 2006

5

Around since the 80’s

Page 6: What’s new in Condor? Condor Week 2006

6

Around since the 80’s

80’s Mullet Boy

Page 7: What’s new in Condor? Condor Week 2006

7

100 people surveyed! Favorite “ility” ?

Page 8: What’s new in Condor? Condor Week 2006

8

100 people surveyed!Favorite “ility” ?

Deployability!

Page 9: What’s new in Condor? Condor Week 2006

9

Existing PortsExisting Ports• Digital UNIX 4.0        Alpha• AIX 5.2 (clipped) PowerPC        • Tru64 5.1 (clipped)      Alpha• HP UNIX 10.20 PA RISC• HP UNIX 11.00 (clipped using hpux10.20 32 bit) PA RISC• Irix 6.5 (clipped) SGI• Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 (clipped) Alpha• Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 Intel x86• Linux 2.4.x (glibc 2.2) - Red Hat 8     Intel x86• Linux 2.4.x (glibc 2.3) - Red Hat 9     Intel x86       • Enterprise Server 8.1  Intel Itanium• Solaris 8       Sparc   • Solaris 9       Sparc• Microsoft Windows 2000 or XP (clipped)   Intel x86

Page 10: What’s new in Condor? Condor Week 2006

10

New Ports› Introduced in v6.6.x

MacOSX (“clipped") PowerPC Debian Linux 3.1 Intel x86 Fedora Core 1 Intel x86     Red Hat Enterprise Linux 3  Intel x86 SuSE Linux Enterprise Server 8.1  Intel

Itanium › Introduced in v6.7.x

AIX 5.1 (“clipped") PowerPC Fedora Core 2 on x86 Fedora Core 3 on x86 SuSE 8.0 ("clipped") on AMD64 Solaris 10 ("clipped") on Sparc Scientific Linux (Release 303) on x86

› Still to be introduced in v6.7.x (before v6.8.0) HPUX 11i 64-bit pa-risc RHEL 4 on x86 “native” 64 bit AMD Linux

Sigh…

“Psilord” – The Condor porting doctor. Talk to him in person tomorrow.

Page 11: What’s new in Condor? Condor Week 2006

11

Porting Table› See http://www.cs.wisc.edu/condor/porting/port_table.html

› Highlights Almost every 32-bit Linux flavor as “full” Every other Unix, MacOS and Windows available as “clipped” Solaris 10 and HP-UX 11.x now “clipped” FreeBSD 4 contribution from Yahoo!, added 5 and 6 X86_64 Linux: “full” running in the lab

Page 12: What’s new in Condor? Condor Week 2006

12

Backfill Jobs

› Execute machines will run a locally staged executable when otherwise idle.

› Currently designed for BOINC.# Turn on backfill functionality, and use BOINCENABLE_BACKFILL = TRUE BACKFILL_SYSTEM = BOINC # Spawn a backfill job if we've been Unclaimed for more than 5 minutes START_BACKFILL = $(StateTimer) > (5 * $(MINUTE)) # Evict a backfill job if the machine is busy (based on keyboard # activity or cpu load) EVICT_BACKFILL = $(MachineBusy)

Page 13: What’s new in Condor? Condor Week 2006

13

Joining Condor’s Einstein@Home Compute Team

› If you’re running BOINC backfill jobs in Condor and want to use your cycles to help another UW project, please join the Einstein@Home computation

› Join the “Condor Backfill” team: http://einstein.phys.uwm.edu/

team_display.php?teamid=5994 http://einstein.phys.uwm.edu/

create_account_form.php?teamid=5994

Page 14: What’s new in Condor? Condor Week 2006

14

More “deployability”

› “Personal” Condor Support on Win32 LocalSystem not required

› MSI installer on Win32 (thanks Micron!)

› New toolsSafe, dynamic Condor service deployment.More info @ Research BOF 9am Rm219 condor_cold_start and condor_cold_stop

Page 15: What’s new in Condor? Condor Week 2006

15

100 people surveyed! Favorite “ility” ?

Page 16: What’s new in Condor? Condor Week 2006

16

100 people surveyed!Favorite “ility” ?

Availability!

Page 17: What’s new in Condor? Condor Week 2006

17

GCB layer

Server app

TCP/IP

GCB layer

Client app

TCP/IP

transl

ate

connect

Relay point

listenaccept

Condor with Firewalls and

NATS:GCB in v6.8.0!

Page 18: What’s new in Condor? Condor Week 2006

18

Job Progress continues if connection is interrupted

› Now for Vanilla, Java, and Grid universe jobs, Condor supports reestablishment of the connection between the submitting and executing machines. If network outage between execute and submit

machine If submit machine restarts Grid Universe was tricky…

› To take advantage of this feature, put the following line into their job’s submit description file:

JobLeaseDuration = <N seconds>For example:

job_lease_duration = 1200

Page 19: What’s new in Condor? Condor Week 2006

19

Job Progress continues if submit machine fails

› Condor can now support a submit machine “hot spare” (schedd failover) If your submit machine A is down for

longer than N minutes, a second machine B can take over

Requires shared filesystem between machines A and B

Page 20: What’s new in Condor? Condor Week 2006

20

Central Manager Failover

› Condor Central Manager has two services

› condor_collector Now a list of collectors is supported

› condor_negotiator (matchmaker) If fails, election process, another takes over Accounting state is peridocially replicated Contributed technology from Technion

Page 21: What’s new in Condor? Condor Week 2006

21

Reliability, cont.

› Time shifts

› Quill

› Closing windows of vulnerability

Page 22: What’s new in Condor? Condor Week 2006

22

100 people surveyed! Favorite “ility” ?

Page 23: What’s new in Condor? Condor Week 2006

23

100 people surveyed!Favorite “ility” ?

Lighweight?

Page 24: What’s new in Condor? Condor Week 2006

24

100 people surveyed!Favorite “ility” ?

Lighweight?

Page 25: What’s new in Condor? Condor Week 2006

25

100 people surveyed! Favorite “ility” ?

Page 26: What’s new in Condor? Condor Week 2006

26

100 people surveyed!Favorite “ility” ?

Functionality!

Page 27: What’s new in Condor? Condor Week 2006

27

Security› Common Authentication Methods

between Condor on Unix and Win32 Kerberos 1.4

• Additional hopeful benefit: Authentication against MS Active Directory!

SSL Password (shared secret)

› Starter only runs known executables

› More powerful, unified map file(s)› GSI credentials delegated

Page 28: What’s new in Condor? Condor Week 2006

28

With Condor on Win32, it be nice if …

› My jobs could access my files just like the condor_shadow can

› I didn’t have to tie my execute machines to a single account

› I didn’t have to run condor_store_cred from every machine where my credential is needed

(thank you Optena)

Page 29: What’s new in Condor? Condor Week 2006

29

The Windows CredD

y0urs

myp4sswd

C:\>condor_store_cred addAccount: gquinn@CROW

Enter password:

Operation succeeded.

credd

› A centralized repository for user passwords

“store password”

<password>

Page 30: What’s new in Condor? Condor Week 2006

30

The Windows CredD

y0urs

myp4sswdschedd

shadowSubmit machines can use the CredD to impersonate the user in the shadow

“fetch password”

<password>

Page 31: What’s new in Condor? Condor Week 2006

31

The Windows CredD

y0urs

myp4sswdstarter

condor_exec.exeExecute machines can use the CredD to run jobs as the submitting user!

“fetch password”

<password>

Page 32: What’s new in Condor? Condor Week 2006

32

Running Jobs as Submitting User

CREDD_HOST = vault.cs.wisc.edu

STARTER_ALLOW_RUNAS_OWNER = True

CREDD_CACHE_LOCALLY = True

› In submit file: Run_job_as_owner = true

› In config file on submit and execute nodes:

Page 33: What’s new in Condor? Condor Week 2006

33

Some Condor APIs› Command Line tools

condor_submit, condor_q, etc -format, -constraint, -xml

› Condor Perl Module› Chirp› Checkpoint Library API › MW --- improved!› DRMAA (Works w/ Win32, on SourceForge)› Condor Grid ASCII Protocol (GAHP)› Web Service Interface

Page 34: What’s new in Condor? Condor Week 2006

34

DRMAA› Distributed Resource Management

Application API (DRMAA) GGF Working Group An API specification for the submission and

control of jobs to one or more Distributed Resource Management (DRM) systems

› An API with C and Java bindings not a protocol

› Scope Does: job submission, monitoring, control,

final status Does not: file staging, reservations, security,

Page 35: What’s new in Condor? Condor Week 2006

35

Condor GAHP

› The Condor GAHP is a relatively low-level protocol based on simple ASCII messages through stdin and stdout

› Supports a rich feature set including two-phase commits, transactions, and optional asynchronous notification of events

Page 36: What’s new in Condor? Condor Week 2006

36

GAHP, contExample:

R: $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $

S: GRAM_PING 100 vulture.cs.wisc.edu/forkR: ES: RESULTSR: ES: COMMANDSR: S COMMANDS GRAM_JOB_CANCEL GRAM_JOB_REQUEST

GRAM_JOB_SIGNAL GRAM_JOB_STATUS GRAM_PING INITIALIZE_FROM_FILE QUIT RESULTS VERSION

S: VERSIONR: S $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\

Gahpd $S: INITIALIZE_FROM_FILE /tmp/grid_proxy_554523.txtR: SS: GRAM_PING 100 vulture.cs.wisc.edu/forkR: SS: RESULTSR: S 0S: RESULTSR: S 1R: 100 0S: QUITR: S

Page 37: What’s new in Condor? Condor Week 2006

37

Web Service Interfaces› SOAP over http or https to

the Condor daemons

› Use any language or platform (where you can find a decent SOAP library)› Functionality Exposed

in current release Submit jobs Retrieve job output Remove/hold/release jobs Query machine status (fetch ads from collector) Query job status (fetch ads from the schedd)

Page 38: What’s new in Condor? Condor Week 2006

38

Getting machine status via

SOAP (in Java with Axis)locator = new CondorCollectorLocator();

collector = locator.getcondorCollector(new URL(“http://machine:port”));

ads = collector.queryStartdAds(“Memory>512“);

Because we give you WSDL information you don’thave to write any of these functions.

Page 39: What’s new in Condor? Condor Week 2006

39

More Functionality changes..

› FINALLY, clean/consistent cross-platform quoting rules for arguments and environment variables (see condor_submit man page)

› Schedd can run HawkEye modules, just like the Startd Enables monitoring on the submit machine

› condor_history : now faster than a snail, and cleans up droppings.

› DeferralTime, DeferralWindow Coordinated starts

› BIND_ALL_INTERFACES in config file› WANT_REMOTE_IO in job ClassAd

Page 40: What’s new in Condor? Condor Week 2006

40

ClassAd Functions in Condor!

› Conditionals IfThenElse(condition,then,else)

› String functions Strcat(), strcmp(), toUpper(), etc.

› StringList functions Example of a “string list” (CSV style)

• Mylist = “Joe, Jon, Jeff, Jim, Jake” StrListContains(), StrListAppend(),

StrListRemove(), etc.› Others

Regular expressions, arithmetic, etc…

Page 41: What’s new in Condor? Condor Week 2006

41

Accounting Groups andGroup Quota Support

› Account Group (w/ CORE Feature Animation)› Account Group Quota (inspiration CDF @

Fermi) Sample Problem: Cluster w/ 500 nodes, Chemistry

Dept purchased 100 of them, Chemistry users must always be able to use them

Could use Machine Rank…• but this ties to specific machines

Or could use new group support• Each group can be given a quota in config file• Job ads can specify group membership• Group quotas are satisfied first• Accounting by user and by group

Page 42: What’s new in Condor? Condor Week 2006

42

100 people surveyed! Favorite “ility” ?

Page 43: What’s new in Condor? Condor Week 2006

43

100 people surveyed!Favorite “ility” ?

Universability!

Page 44: What’s new in Condor? Condor Week 2006

44

› With new Grid Universe, always specify a ‘gridtype’. So the old “globus” Universe is now declared as:

universe = grid gridtype = gt2› Other gridtypes?

GT2 (Globus Toolkit 2) GT3 (Globus Toolkit 3.2) GT4 (Globus Toolkit 3.9.5+) UNICORE Nordugrid PBS (OpenPBS, PBSPro – technology from INFN) LSF (Platform LSF – technology from INFN) CONDOR (thanks gLite!)

Grid Universe

‘Condor-C’

‘Condor-G’

Page 45: What’s new in Condor? Condor Week 2006

45

Other Grid Universe improvements

› Condor-G has support for credential refresh via the MyProxy Online Credential Management in NMI

http://grid.ncsa.uiuc.edu/myproxy (both GT2 and GT4)

› GT4 : we start a GridFTP server behind the scenes GridFTP server bundled w/ Condor nowadays

› Some functionality present in Condor-G added to Condor-C Forwarding of refreshed credentials (EGEE) GSI authentication support Cleaner ClassAd representation (URL)

Page 46: What’s new in Condor? Condor Week 2006

46

Parallel Universe

› Replaces the “MPI” universe

› Allows running arbitrary programs that need to gang-schedule multiple machines MPICH, LAM, … FT-MPICH (Seoul National Univ) Great for testing environments

Page 47: What’s new in Condor? Condor Week 2006

47

Hey Jobs! We’re watching you!

› Local Universe Just like Scheduler

Universe, but there is a condor_starter

All advantages of the starter

schedd

starter

job

Submit

startd

starter

job

Execute

Hey, job, behave or else!

Page 48: What’s new in Condor? Condor Week 2006

48

100 people surveyed! Favorite “ility” ?

Page 49: What’s new in Condor? Condor Week 2006

49

100 people surveyed!Favorite “ility” ?

Scalability!

Page 50: What’s new in Condor? Condor Week 2006

50

Faster Negotiation› SIGNIFICANT_ATTRIBUTES determined

automatically Job attributes AutoClusterId and

AutoClusterAttributes Rounding of Attributes

› Schedd uses non-blocking TCP connects to the startd

› Negotiator caching

› Collector Forks for queries

› More coming…

Page 51: What’s new in Condor? Condor Week 2006

51

Scalability, cont.› Knobs

GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE,

GRIDMANAGER_MAX_PENDING_SUBMIT_PER_RESOURCE,

GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE

› One instance of gridmanager handles multiple jobs (all from a given user)

› One instance of condor_dagman can run multiple dags Is the Shadow next?

› Buffered I/O read on schedd restart (thanks Yahoo!)

Page 52: What’s new in Condor? Condor Week 2006

52

Quill› Job ClassAds

information mirrored into an RDBMS

› Both active jobs and historical jobs

› Benefits BOTH scalability and accessibility

QuillSchedd

Job Queue

log

RDBMS

Startd …

Master

Queue +

History Tables

Page 53: What’s new in Condor? Condor Week 2006

53

Version 6.9.x

Page 54: What’s new in Condor? Condor Week 2006

54

What’s brewing for after v6.8.0?› More data, data, data

Stork distributed now v6.7.x, incl DAGMan support – next it is NeST’s turn.

NeST manage Condor spool files, ckpt servers• GridFTP used to move the bits

Quill++ and CondorDB goodness

› Virtual Machines (and the future of Standard Universe) Research BOF w/ Jaeyoung Moon, rm219

9am

Page 55: What’s new in Condor? Condor Week 2006

55

SOAP API

› First focus will be to finish interfaces used by all command-line tools condor_userprio, condor_cod, …

› Explore message-based security Ian Alderman’s work w/ signed

ClassAd attributes

Page 56: What’s new in Condor? Condor Week 2006

56

Privilege Separation

› No more root in the Condor daemons!› Instead, a small component will be

responsible for privileged operations› Initial exploratory work w/ GNU userv

(Cambridge)› Now focusing on integration w/ glexec

(gLite / nikhef)

Page 57: What’s new in Condor? Condor Week 2006

57

“The Year of the Schedd”

› Schedd is juggling to many tasks Break it down into smaller pieces, more

modular› Scalability

All non-blocking I/O Hierarchy of schedds

› Schedd-on-the-side “Scheduler booster” Transform & delegate job classads to

different grids A “job router” for a grid

Page 58: What’s new in Condor? Condor Week 2006

58

Thank you!