Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department...

99
Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison [email protected]

Transcript of Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department...

Page 1: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

Condor Tutorial for Administrators

INFN-Bologna, 6/28/99

Derek WrightComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]

Page 2: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

2

Conventions Used In This Presentation

A slide with an all-yellow background is the beginning of a new “chapter”• The slides after it will describe each entry

on the yellow slide in great detail A Condor tool that users would use will

be in red italics A ClassAd attribute name will be in blue A UNIX shell command or file name will

be in courier font

Page 3: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

3

What is Condor?

A system for “High-Throughput Computing”

Lots of jobs over a long period of time, not a short burst of “high-performance”

Condor manages both resources (machines) and resource requests (jobs)

Supports additional features for jobs that are re-linked with Condor libraries:• checkpointing• remote system calls

Page 4: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

4

What’s Condor Good For?

Managing a large number of jobs• You specify the jobs in a file and submit

them to Condor, which runs them all and sends you email when they complete

• Condor maintains a persistent job queue• Checkpointing allows guaranteed forward

progress of your jobs, even jobs that run for weeks before completion

Page 5: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

5

What’s Condor Good For? (continued)

Managing a large number of machines• Condor daemons run on all the machines

in your pool and are constantly monitoring machine state

• You can query Condor for information about your machines

• Condor handles all background jobs in your pool with minimal impact on your machine owners

Page 6: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

6

What is a Condor Pool?

“Pool” can be a single machine, or a group of machines

Determined by a “central manager” - the matchmaker and centralized information repository

Each machine runs various daemons to provide different services, either to the users who submit jobs, the machine owners, or the pool itself

Page 7: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

7

The Condor Daemons condor_master (controls everything else) condor_startd (executing jobs)

• condor_starter (helper for starting jobs) condor_schedd (submitting jobs)

• condor_shadow (submit-side helper) condor_collector (only on Central Manager) condor_negotiator (only on CM) You only have to run the daemon(s) for the

service(s) you want to provide

Page 8: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

8

condor_master Starts up all other Condor daemons If there are any problems and a

daemon exists, it restarts the daemon and sends email to the administrator

Checks the time stamps on the binaries it is configured to spawn, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version

Page 9: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

9

condor_master (cont’d) Provides access to many remote

administration commands:• condor_reconfig• condor_restart, condor_off, condor_on

Default server for many other commands:• condor_config_val, etc.

Periodically runs condor_preen to clean up any files Condor might have left on the machine (the rest of the daemons clean up after themselves, as well)

Page 10: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

10

condor_startd

Represents a machine to the Condor pool Enforces the wishes of the machine

owner (the owner’s “policy”) Responsible for starting, suspending,

and stopping jobs Spawns the appropriate condor_starter,

depending on the type of job Provides other administrative

commands: (for example, condor_vacate)

Page 11: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

11

condor_starter

Spawned by the condor_startd to handle all the details of starting and managing the job (for example, transferring the job’s binary to the executing machine or sending back exit status)

On SMP machines, you get one condor_starter per CPU

For PVM jobs, the starter also spawns a PVM daemon (condor_pvmd)

Page 12: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

12

condor_schedd

Represents users to the Condor pool Maintains persistent queue of jobs Responsible for contacting available

machines and spawning waiting jobs Services most user commands:

• condor_submit• condor_rm• condor_q

Page 13: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

13

condor_shadow Represents the job on the submit machine Services requests from “standard” jobs for

“remote system calls”, including all file I/O Is responsible for making decisions on

behalf of the job (for example, where to store the checkpoint file)

There will be one condor_shadow process running on your submit machine for each currently running Condor job

Page 14: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

14

condor_shadow (cont’d) The shadow doesn’t put much load

on your submit machine:• Almost always blocked waiting for

requests from the job or doing I/O• Relatively small memory footprint

Still, you can limit the impact of the shadows on a given submit machine:• They can be started by Condor with a

“nice-level” that you configure (renice)• Can put a limit on the total number of

shadows running on a machine

Page 15: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

15

condor_collector

Collects information from all other Condor daemons in the pool

Each daemon sends a periodic update called a “ClassAd” to the collector

Services queries for information:• Queries from other Condor daemons• Queries from users (condor_status)

Page 16: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

16

condor_negotiator

Performs “matchmaking” in Condor Gets information from the collector

about all available machines and all idle jobs

Tries to match jobs with machines that will serve them

Both the job and the machine must satisfy each other’s requirements (this is called “2-way matching”)

Page 17: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

17

Layout of a Personal Condor PoolCentral Manager

master

collector

negotiator

schedd

startd

= ClassAd Communication Pathway

= Process Spawned

Page 18: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

18

Layout of a General Condor PoolCentral Manager

master

collector

negotiator

schedd

startd

= ClassAd Communication Pathway

= Process Spawned

Submit-Only

master

schedd

Execute-Only

master

startd

Regular Node

schedd

startd

master

Regular Node

schedd

startd

master

Execute-Only

master

startd

Page 19: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

19

What happens when you submit a job to Condor?

condor_submit contacts condor_schedd and adds job to the job queue

condor_schedd sends ClassAd to the condor_collector requesting a machine

condor_negotiator matches the request with an available machine

condor_schedd “claims” the machine and spawns a condor_shadow

Page 20: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

20

What happens when you submit a job to Condor? (part 2)

condor_shadow contacts condor_startd and requests the appropriate condor_starter

condor_starter actually spawns application, and connects it to the condor_shadow

condor_startd monitors the machine and waits for commands

Either the application completes, or the condor_startd forces it to either suspend or vacate

Page 21: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

21

Condor System Structure

Page 22: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

22

Considerations for Installing a Condor Pool

What machine should be your central manager?

Does your pool have a shared file system? Where should you install your Condor

binaries and configuration files? Where should you put the local

directories for each machine? Will you start the daemons as root or as

some other user?

Page 23: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

23

What machine should be your central manager?

The central manager (CM) is very important for the proper functioning of your pool

You want a machine that will be online all the time, or will be rebooted quickly if there is a problem

If the CM crashes, jobs that are currently matched will continue to run, but new jobs will not be matched

A good network connection helps

Page 24: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

24

Does your pool have a shared file system?

A shared file system is essential if you wish to run “vanilla” jobs

It can also make administration of a large pool easier

NFS works better with Condor than AFS, since Condor does not manage AFS tokens (yet), though either one will work

Page 25: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

25

Where should you install your binaries and configuration files?

Putting the config files on a shared file system makes administration much easier

Putting the binaries on a shared file system makes installing a new version easier, but it can be less stable (since problems with the network can cause daemons to crash)

condor_master on the local disk is a good compromise

Page 26: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

26

Where should you put the local directories for each machine?

You need a fair amount of disk space in the spool directory for each condor_schedd (to hold the job queue and the binaries for each job submitted).

The execute directory is used by the condor_starter to hold the binary for any Condor job running on a machine

The log directory is used by all daemons… more space = more saved info

Page 27: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

27

Will you start the daemons as root or some other user?

If you have root access, we recommend you start the daemons as root• More secure• Less confusion for users

If you don’t have root access, Condor will still work, users just have to take some extra steps to submit jobs

Can have “personal Condor” installed - only you can submit jobs

Page 28: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

28

Basic Installation Procedure

1) Decide what version and parts of Condor to install and download them

2) Install the “release directory” - all the Condor binaries and libraries

3) Setup the Central Manager 4) (optional) Setup Condor on any

other machines you wish to add to the pool

5) Spawn the Condor daemons

Page 29: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

29

The Different Versions of Condor

We distribute two versions of Condor: • Stable Series

– Heavily tested, recommended for useHeavily tested, recommended for use– 2nd2nd number of version string is even (6. number of version string is even (6.00.3).3)

• Development Series– Latest features, not necessarily well-testedLatest features, not necessarily well-tested– 2nd2nd number of version string is odd (6. number of version string is odd (6.11.8).8)– Not recommended unless you know what Not recommended unless you know what

you are doing and/or you are doing and/or needneed a new feature a new feature

Page 30: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

30

Condor Versions (cont’d) All daemons advertise a CondorVersion

attribute in the ClassAd they publish You can also view the version string by

running ident on any Condor binary All parts of Condor on a single machine

should run the same version! Machines in a pool can usually run

different versions and communicate with each other

It will be made very clear when a version is incompatible with older versions

Page 31: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

31

Downloading Condor Go to http://www.cs.wisc.edu/condor/ Fill out the form and download the different

pieces you need Normally, you want the full stable release There are also “contrib” modules for non-

standard parts of Condor, or individual pieces of the development release that you might need (e.g. SMP support)

Distributed as compressed “tar” files Once you download, unpack them

Page 32: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

32

Install the Release Directory In the directory where you unpacked

the tar file, you’ll find a release.tar file with all the binaries and libraries

condor_install will install this as the release directory for you

In a pool with a shared release directory, you should run condor_install somewhere with write access to the shared directory

You need a separate release directory for each platform!

Page 33: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

33

Setup the Central Manager

You must configure Condor specially on your central manager, so that it knows it needs to spawn the additional daemons

Easiest way to do this is by using condor_install

There’s a special option for setting up a central manager

Page 34: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

34

Setup any other machines you wish to add to the pool

If you have a shared file system, once you run condor_install on your file server (and again on your central manager if it’s a separate machine) you can just run condor_init on any other machine you wish to add to your pool

Without a shared file system, you must run condor_install on each host

Page 35: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

35

Spawn the Condor daemons

Once Condor is configured and setup, you just have to spawn the condor_master on each host to “start” Condor

You should startup Condor on the Central Manager first

The user you spawn the condor_master as makes a big difference: root vs. “condor” vs. another user

Page 36: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

36

Introduction to Condor’s Configuration Files

Condor’s configuration is a concatenation of multiple files, in order - definitions in later files overwrites previous definitions

Layout and purpose of the different files:• Global config file• Other shared files• Local config file• Root config file (optional)

Page 37: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

37

Global Config File

All shared settings across your entire pool

Found either in file pointed to with the CONDOR_CONFIG environment variable, /etc/condor/condor_config, or the home directory of the “condor” user

Most settings can be in this file Only works as a “global” file if it is on a

shared file system

Page 38: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

38

Other shared files

You can configure a number of other shared config files:• files to hold common settings to make

it easier to maintain (for example, all policy expressions, which we’ll see later)

• platform-specific config files

Page 39: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

39

Local config file

Any machine-specific settings• local policy settings for a given owner• different daemons to run (for example,

on the Central Manager!) Can either be on the local disk of

each machine, or have separate files in a shared directory, each named by hostname

Page 40: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

40

Root config file (optional)

You can specify a “root” config file, which is always processed after all other files

This allows root to specify certain settings which cannot be changed by another user (like the path to the Condor daemons)

Only useful if daemons are started as root but someone else has access to edit Condor’s config files

Page 41: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

41

Basic syntax

# is a comment A “\” at the end of a line is a line-

continuation, so both lines are treated as one big entry

All names are case insensitive “Macros” have the form:

• Attribute_Name = value You reference other macros with:

• A = $(B)

Page 42: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

42

Macros vs. Expressions

“Expressions” have the form:• Expression_Name : value

Expressions have special meaning to Condor (in particular, the policy expressions)

You will never need to add new expressions You might want to add macros to advertise

new properties about your machines to the Condor pool (for example, NetworkBandwidth)

Page 43: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

43

Any questions?

Nothing is too basic If I was unclear, you probably are not

the only person who doesn’t understand, and the rest of the day will be even more confusing

Page 44: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

Hands-On Exercise #1 Installing a

“Personal” Condor Pool

Page 45: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

45

Hands-On Exercise #1 Login to your machine as user “condor” You will see two windows:

• Netscape, with instructions• An xterm, where you execute commands

To begin, click on Personal Condor Please follow the directions carefully Any lines beginning with % are

commands that you should execute in your xterm

If you accidentally exit Netscape, click on “Tutorial” in the Start menu

Page 46: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

Lunch break

Please be back by 13:30

Page 47: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

Welcome Back

Page 48: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

48

Host/IP Security in Condor You can configure each machine in your

pool to allow or deny certain actions from different groups of machines:• “read” access - querying information

– condor_status, condor_qcondor_status, condor_q, etc, etc

• “write” access - updating information– condor_submitcondor_submit, adding a node to the pool, , adding a node to the pool,

etcetc

• “administrator” access– condor_on, off, reconfig, restartcondor_on, off, reconfig, restart... ...

• “owner” access – Things a machine owner can do (Things a machine owner can do (vacatevacate))

Page 49: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

49

Setting up Host/IP-address Security in Condor (part 1)

To configure, you list what hosts are allowed or denied to perform each action• If you list hosts that are allowed,

everything else is denied• If you list hosts that are denied,

everything else is allowed• If you list both, only hosts that are listed

in “allow” but not in “deny” are allowed

Page 50: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

50

Setting up Host/IP-address Security in Condor (part 2)

There are many possibilities for specifying which hosts are allowed or denied:• Host names, domain names• IP addresses, subnets• Wildcards

– ‘‘*’ can be used anywhere (once) in a host name *’ can be used anywhere (once) in a host name (for example, “infn-corsi*.corsi.infn.it)(for example, “infn-corsi*.corsi.infn.it)

– ‘‘*’ can be used at the end of any IP address (e.g. *’ can be used at the end of any IP address (e.g. “128.105.101.*” or “128.105.*”)“128.105.101.*” or “128.105.*”)

Page 51: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

51

Setting up Host/IP-address Security in Condor (part 3)

Can define values that effect all daemons:• HOSTALLOW_WRITE, HOSTDENY_READ,

HOSTALLOW_ADMINISTRATOR, etc. Can define daemon-specific settings:

• HOSTALLOW_READ_SCHEDD, HOSTDENY_WRITE_COLLECTOR, etc.

Write access doesn’t automatically provide read access: you must grant both!

Page 52: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

52

Example Host/IP Security Settings

HOSTALLOW_WRITE = *.infn.it

HOSTALLOW_ADMINISTRATOR = infn-corsi1*, \$(CONDOR_HOST), axpb07.bo.infn.it, \$(FULL_HOSTNAME)

HOSTDENY_ADMINISTRATOR = infn-corsi15

HOSTDENY_READ = *.gov, *.mil

HOSTDENY_ADMINISTRATOR_NEGOTIATOR = *

Page 53: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

Hands-On Exercise #2 Configuring Host/IP Security

Page 54: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

54

Hands-On Exercise #2

Please point your browser to the new instructions:• Go back to the tutorial homepage• Click on Host Security

Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm

If you exited Netscape, just click on “Tutorial” from your Start menu

Page 55: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

55

Administering a Real Pool

Having a shared release directory is key

Viewing things with condor_status Viewing things with condor_q

Page 56: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

56

Having a shared release directory is key

Keep all of your config files in one place• Allows you to have a real global config file,

with common values across the whole pool• Much easier to make changes (even for

“local” config files in one shared directory) Keep all of your binaries in one place

• Prevents having different versions accidentally left on different machines

• Easier to upgrade

Page 57: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

57

Viewing things with condor_status

condor_status has lots of different options to display various kinds of info

Supports “-constraint” so you can only view ClassAds that match an expression you specify

Supports “-format” so you can get the data in whatever form you want (very useful for writing scripts)

View any kind of daemon ClassAd

Page 58: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

58

Viewing things with condor_q

View the job queue The “-long” option is useful to see

the entire ClassAd for a given job Also supports the “-constraint”

option Can view job queues on remote

machines with the “-name” option

Page 59: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

Hands-On Exercise #3 Setting up a Pool

with a Shared Release Directory

Page 60: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

60

Hands-On Exercise #3

Please point your browser to the new instructions:• Go back to the tutorial homepage• Click on Shared Release Directory

Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm

If you exited Netscape, just click on “Tutorial” from your Start menu

Page 61: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

61

Advanced Installation Options

Spawning the Condor daemons automatically at reboot

“Full installation” of condor_compile Advertising your own attributes in

the machine ClassAd Setting up Host/IP security in Condor

(which we already talked about) Customizing the startd policy

Page 62: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

62

Spawning the Condor Daemons automatically at reboot

If you are running Condor as root, you probably want to have your boot scripts start the condor_master automatically

Provides more robust service, less manual work for the administrators

We provide a “SysV-style” init script: • <release>/etc/examples/condor.boot

Exact details depends on your operating system platform

Page 63: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

63

Why Perform a “Full Installation” of condor_compile?

condor_compile used to re-link user jobs with the Condor libraries so they become “standard” jobs

By default, condor_compile only works with certain commands (gcc, g++, g77, cc, CC, f77, f90, ld)

With a “full-installation”, condor_compile will work with any command (in particular, “make”)

Page 64: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

64

How to Perform a Full Installation of condor_compile:

Move your real ld binary, the “linker”, to “ld.real”• The path to “ld” varies from platform to

platform… though it’s usually “/bin/ld” Install Condor’s “ld” script in its place If condor_compile is used, our ld will do

the Condor-specific magic If not, our ld will just call the real ld

and everything will work like normal

Page 65: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

65

Advertising Your Own Attributes in the Machine ClassAd

Add new macro(s) to the config file • This is usually done in the local config file• Can name the macros anything, so long as

the names don’t conflict with existing ones Tell the condor_startd to include these

other macros in the ClassAd it sends out• Edit the STARTD_EXPRS macro to include

the names of the macros you want to advertise (comma separated)

Page 66: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

Hands-On Exercise #4 Defining Your Own

Attributes in the Startd Classad

Page 67: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

67

Hands-On Exercise #4

Please point your browser to the new instructions:• Go back to the tutorial homepage• Click on Local Startd Attributes

Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm

If you exited Netscape, just click on “Tutorial” from your Start menu

Page 68: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

10 Minute Break

Questions are welcome….

Page 69: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

69

Configuring the Startd Policy

Allows administrators or machine owners the power to control when and if Condor starts and stops jobs on a machine

Lots of flexibility: can base any policy expression on any attributes in the startd’s ClassAd, or the ClassAd of the currently running job

Many mechanisms available: suspending, checkpointing, hard kill, etc.

Page 70: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

70

Basic Progression a Job Can Pass Through When an Owner Returns

No owner: job is running The owner returns: job is suspended

• If the owner leaves again shortly, the job is resumed

The owner is still there: job is vacated• soft-kill... do a checkpoint if possible

The vacate is taking too long:• job is hard-killed (kill -9)

Page 71: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

71

Introduction to the Policy Expressions

The policy expressions control the transitions between various “states” and “activities” a machine can be in

All expressions use boolean logic It is common to define macros for

complicated terms in your expressions to make them easier to read

Often, you only need to edit these macros to customize your policy

Page 72: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

72

Machine States

PREEMPTING

CLAIMED

UNCLAIMED

OWNER

MATCHED

begin

Page 73: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

73

Machine Activities

PREEMPTING

CLAIMED

UNCLAIMED

OWNER

MATCHED

Benchmarking

begin

Idle

Suspended

Busy

Idle

Killing

Vacating

Idle

Idle

Page 74: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

74

The Policy Expressions

STARTRANK

WANT_SUSPENDSUSPENDCONTINUEPREEMPT

WANT_VACATEKILL

Page 75: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

75

True

True

Road Map of the Policy

Expressions

STARTSTARTSTARTSTART

WANT SUSPENDWANT SUSPENDWANT SUSPENDWANT SUSPEND

SUSPENDSUSPENDSUSPENDSUSPEND

VacatingVacatingVacatingVacating

PREEMPTPREEMPTPREEMPTPREEMPT

KILLKILLKILLKILL

True

True

True

True

False

WANT VACATEWANT VACATEWANT VACATEWANT VACATE

KillingKillingKillingKilling

False

= Expression

= Activity

Page 76: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

76

The START expression

The most important policy expression This is the “requirements” expression

for machines Controls when Condor will start jobs Can reference attributes of the job (such

as its size or the user who submitted it) A machine will only leave the Owner

state if START evaluates to True (or Undefined)

Page 77: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

77

Example Start Expressions

KeyboardIsIdle = (KeyboardIdle > (15 * $(MINUTE)))

CPUIsIdle = (LoadAvg - CondorLoadAvg < 0.3)

START : $(KeyboardIsIdle) && $(CPUIsIdle)

or

START : Owner == “wright” || Owner == “condor” || \

($(KeyboardIsIdle) && $(CPUIsIdle))

or

START : True

Page 78: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

78

The RANK Expressions Both machines and jobs can “rank”

what they’re looking for If a machine is claimed, it still

advertises that it’s available• Always looking for a higher-ranked job• Will preempt the current job if a better

one is available. Jobs can rank machines - in a large

pool, users can prefer certain hosts

Page 79: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

79

Using Machine RANK Expressions

The expression is a floating point number• Use “+” instead of “&&”• (X == Y) evaluates to 0 or 1• Allows unlimited flexibility

Often used in large pools made up of individual groups that machines owned by one group will always run jobs submitted by the users in that group

Page 80: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

80

Example Rank Expression

MachineOwner = (Owner == “wright”)

Friend = (Owner == "tannenba" || \

Owner == ”ballard”)

ResearchGroup = (Owner == "jbasney" || \

Owner == "raman”)

Rank : Friend + ResearchGroup*10 + \

MachineOwner*20

Startd_Exprs = $(Startd_Exprs), Friend, \

MachineOwner, ResearchGroup

Page 81: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

81

Example Rank Expression Explained

First, we define different groups of people that we’re interested in (Friend, ResearchGroup and MachineOwner)

Then, we define the Rank (it’s an expression, so we need to use “:”) to give different weights each group

Finally, we add these new attributes to the list of attributes we publish so that Rank can be evaluated remotely

Page 82: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

82

WANT_SUSPEND vs. SUSPEND

WANT_SUSPEND determines if the startd should even consider entering the Suspended activity:• If WANT_SUSPEND is True, while a job is

running, SUSPEND is checked, and if it evaluates to True, the job is suspended

• If WANT_SUSPEND if False, SUSPEND is never evaluated, and while the job is running, PREEMPT is checked

Page 83: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

83

CONTINUE

Only evaluated while in the Suspended activity (WANT_SUSPEND must therefore be True)

If CONTINUE evaluates to True, the job is resumed and the machine goes back to the Busy activity

Page 84: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

84

PREEMPT

Specifies when a machine enters the Preempting state

Must handle two cases (and usually has two separate terms in the expression):• WANT_SUSPEND is True, and the job has

been suspended longer than the owner wants

• WANT_SUSPEND is False, and the owner is using the machine again

Page 85: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

85

WANT_VACATE vs. KILL

WANT_VACATE is only evaluated when PREEMPT is True and the machine is entering the Preempting state

Determines if a vacate (checkpoint) is wanted, or if the job should be immediately hard killed

KILL is only evaluated if the job is checkpointing (WANT_VACATE was True)

If True, the job is hard-killed

Page 86: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

86

Every Possible

Transition

START WANT_SUSPEND

SUSPEND CONTINUE PREEMPT

WANT_VACATE KILL

Policy Expressions

Page 87: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

87

Final Notes on Startd Policy Please read the Administrator’s Manual

to Condor for a complete explanation of the previous diagram• See the chapter on “Configuring the

Startd Policy” This is all pretty confusing and

complex:• If you have questions, please send them

to [email protected]• We can try to translate an English

explanation of the policy you want into expressions for Condor

Page 88: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

Hands-On Exercise #5 Customizing the

Startd Policy Expressions

Page 89: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

89

Hands-On Exercise #5

Please point your browser to the new instructions:• Go back to the tutorial homepage• Click on Startd Policy

Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm

If you exited Netscape, just click on “Tutorial” from your Start menu

Page 90: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

90

When something goes wrong...

Looking at “condor_q -analyze” Looking at the “UserLog” Looking at the “ShadowLog” Looking at the other daemon’s log files Condor is a large, distributed system,

so analyzing problems can be very difficult:• We’ll give you the basics of where to begin• If you can’t figure it out, send us email and

we’ll be able to help you

Page 91: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

91

Looking at condor_q -analyze

You specify a job or set of jobs you want to analyze

condor_q will try to figure out why the job isn’t running

The output is not as user-friendly as we’d like (though we’re working on it)

Good at finding errors in Requirements expressions set by users

Page 92: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

92

Looking at the UserLog

When the user submits a job, she/he can specify a “UserLog” in their submit file

This will contain a record of if and where the job ran, if it checkpointed, if it was kicked off without a checkpoint, etc.

Very useful in figuring out where a job was running when it was having problems, and to monitor the progress of the job

Required by DAGMan and others

Page 93: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

93

Looking at the ShadowLog

Of the log files generated by the Condor daemons, the ShadowLog usually has the most useful information when debugging a problem with a job

You often want to increase the “Debug Level” of the Shadow and increase the maximum size of the file to get more info:

SHADOW_DEBUG = D_SYSCALLS D_FULLDEBUG

MAX_SHADOW_LOG = 1000000

Page 94: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

94

Analyzing the ShadowLog

Incorrect file permissions or files that were removed are the most common errors

Often useful to grep for a certain job ID• grep “25\.3” ShadowLog | less

At the end of the log, you might find an entry that looks something like “ERROR:”• While not always the most clear, these

entries usually give a very good indication of the problem

Page 95: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

95

Looking at the Other Daemon’s Log Files

If there is no ShadowLog, or no ShadowLog entries for the job with problems, you might have a problem even finding a match for the job

Look in the SchedLog to see if there are errors communicating with the Negotiator

Check for host permission problems Look at the NegotiatorLog on the CM: is

it even negotiating jobs at all

Page 96: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

96

Analyzing the Logs All daemons can display more debugging:

• D_FULLDEBUG and D_COMMAND Can also get timestamps in the logs that

include seconds, which can help pinpoint a problem w/ D_SECONDS

Logs will often rotate quickly with heavy debugging output, so increase MAX_*_LOG as much as your disk space allows

Unfortunately, Condor’s logs are still primarily useful only to the developers

We’re working on changing that

Page 97: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

Hands-On Exercise #6 Examining the

Logs

Page 98: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

98

Hands-On Exercise #6

Please point your browser to the new instructions:• Go back to the tutorial homepage• Click on Examining the Logs

Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm

If you exited Netscape, just click on “Tutorial” from your Start menu

Page 99: Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu.

99

Obtaining Condor Condor can be downloaded from the

Condor web site at:http://www.cs.wisc.edu/condor

Complete Users and Administrators manual available

http://www.cs.wisc.edu/condor/manual Contracted Support is available Questions? Email:

[email protected]