Post on 12-Jan-2016
Installing and Managing a Large
Condor PoolDerek Wright
Computer Sciences DepartmentUniversity of Wisconsin-Madison
wright@cs.wisc.eduwww.cs.wisc.edu/condor
2
Talk Outline
What is Condor and why is it good for large clusters?• The Condor Daemons (the sys admin
view)• A look at the UW-Madison Computer
Science Condor Pool and Cluster• Some other features of Condor that help
for big pools• Future work
3
What is Condor?
A system of daemons and tools that harness desktop machines and commodity computing resources for High Throughput Computing• Large numbers of jobs over long
periods of time• Not High Performance Computing,
which is short bursts of lots of compute power
4
What is Condor? (Cont’d)
Condor matches jobs with available machines using “ClassAds”• “Available machines” can be:
– Idle desktop workstationsIdle desktop workstations– Dedicated clustersDedicated clusters– SMP machinesSMP machines
Can also provide checkpointing and process migration (if you re-link your application against our library)
5
What’s Condor Good For?
Managing a large number of jobs• You specify the jobs in a file and submit
them to Condor, which runs them all and sends you email when they complete
• Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc.
• Condor can handle inter-job dependencies (DAGMan)
6
What’s Condor Good For? (cont’d)
Managing a large number of machines• Condor daemons run on all the machines
in your pool and are constantly monitoring machine state
• You can query Condor for information about your machines
• Condor handles all background jobs in your pool with minimal impact on your machine owners
7
Why is Condor Good for Large Clusters?
Fault-Tolerance at all levels of Condor• Even “dedicated” resources should be
treated like they might disappear at any minute (Condor has been doing this since 1985… we’ve got a lot of experience)
• Checkpointing jobs (when possible) makes scheduling a lot easier, and ensures forward progress
Eases monitoring
8
Condor on Large Clusters (cont’d)
Manages ALL your resources and jobs under one system• Easier for users and administrators
Easy to install and use• No queues to configure or choose from
It’s developed by former system administrators (all the full-time staff)
It’s free (that scales really well)
9
What is a Condor Pool?
“Pool” can be a single machine or a group of machines
Determined by a “central manager” - the matchmaker and centralized information repository
Each machine runs various daemons to provide different services, either to the users who submit jobs, the machine owners, or the pool itself
10
Talk Outline
• What is Condor and why is it good for large clusters?
The Condor Daemons (the sys admin view)• A look at the UW-Madison Computer
Science Condor Pool and Cluster• Some other features of Condor that help
for big pools• Future work
11
The Condor Daemonscondor_master Administrator Agent
condor_collector Centralized Repository of ClassAds
condor_negotiator Performs Matchmaking
condor_startd Resource Agent (Machine)
condor_schedd User Agent (J obs)
condor_starter Monitors/Manages a J ob Process
condor_shadow Handles Remote System Calls,I ntra- J ob Resource Management
condor_dagman Manage Inter- J ob Dependencies
condor_eventd Pool- Wide Events
12
Layout of a Personal Condor PoolCentral Manager
master
collector
negotiator
schedd
startd
= ClassAd Communication Pathway
= Process Spawned
13
Layout of a General Condor PoolCentral Manager
master
collector
negotiator
schedd
startd
= ClassAd Communication Pathway
= Process Spawned
Submit-Only
master
schedd
Execute-Only
master
startd
Regular Node
schedd
startd
master
Regular Node
schedd
startd
master
Execute-Only
master
startd
14
condor_master daemon Starts up all other Condor daemons If there are any problems and a
daemon exists, it restarts the daemon and sends email to the administrator
Checks the time stamps on the binaries it is configured to spawn, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version
15
condor_master (cont’d) Provides access to many remote
administration commands:• condor_reconfig• condor_restart, condor_off, condor_on
Default server for many other commands:• condor_config_val, etc.
Periodically runs condor_preen to clean up any files Condor might have left on the machine (the rest of the daemons clean up after themselves, as well)
16
condor_collector
Collects information from all other Condor daemons in the pool
Each daemon sends a periodic update called a “ClassAd” to the collector
Services queries for information:• Queries from other Condor daemons• Queries from users (condor_status)
Can store historical pool data
17
18
condor_eventd
Administrators specify events in a config file (similar to a crontab, but not exactly):• Date and time• What kind of event (currently, only
“shutdown” is supported)• What machines the event effects
(ClassAd constraint)
19
condor_eventd (cont’d)
When event is approaching, EventD will wake up and query the condor_collector for all machines that match the constraint
EventD then knows how big all the jobs are that are currently running on the effected nodes, network bandwidth to the nearest checkpoint servers, etc.
EventD plans evictions to allow the most computation w/o flooding the net
20
Talk Outline
• What is Condor and why is it good for large clusters?
• The Condor Daemons (the sys admin view)
A look at the UW-Madison Computer Science Condor Pool and Cluster• Some other features of Condor that help
for big pools• Future work
21
Large Condor Pools in HEP and Government Research
UW-Madison CS (~750 nodes) INFN (~270 nodes) CERN/Chorus (~100 nodes) NASA Ames (~330 nodes) NCSA (~200 nodes)
22
Central Manager
Dedicated LinuxCluster (~200
cpus)
Instructional Computer Labs
(~225 cpus)
Checkpoint Server Checkpoint Server
Dedicated Scheduler
Layout of the UW-Madison Pool
Desktop Workstations (~325
cpus)
Flocking to other
Pools
Submit-only
machines at
other sites
EventD
23
Composition of the UW/CS Cluster
Current cluster: 100 Dual XEON 550MHz with 1 gig of RAM (tower cases)
New nodes being installed: 150 Dual 933MHz Pentium III, 36 nodes w/ 2 gigs of RAM, the rest w/ 1 gig (2U racks)
100 Mbit Switched Ethernet to nodes Gigabit Ethernet to the file servers and
checkpoint server
24
Composition of the rest of the UW/CS Pool
Instructional Labs• 60 Intel/Linux• 60 Sparc/Solaris• 105 Intel/NT
“Desktop Workstations”• Includes 12 and 8-way Ultra E6000s, other
SMPs, and real desktops, etc. Central Manager - 600MHz Pentium III
running Solaris, 512 Megs RAM
25
Talk Outline
• What is Condor and why is it good for large clusters?
• The Condor Daemons (the sys admin view)
• A look at the UW-Madison Computer Science Condor Pool and Cluster
Some other features of Condor that help for big pools• Future work
26
Condor’s Configuration
Condor’s configuration is a concatenation of multiple files, in order - definitions in later files overwrites previous definitions
Layout and purpose of the different files:• Global config file• Other shared files• Local config file
27
Global Config File
All shared settings across your entire pool
Found either in file pointed to with the CONDOR_CONFIG environment variable, /etc/condor/condor_config, or the home directory of the “condor” user
Most settings can be in this file Only works as a “global” file if it is on a
shared file system (HIGHLY recommended for large sites!)
28
Other shared files
You can configure a number of other shared config files:• files to hold common settings to make
it easier to maintain (for example, all policy expressions, which we’ll see later)
• platform-specific config files
29
Local config file
Any machine-specific settings• local policy settings for a given owner• different daemons to run (for example, on
the Central Manager) Can either be on the local disk of each
machine, or have separate files in a shared directory, each named by hostname
For large sites: keep them all on AFS or NFS, and in CVS, if possible
30
Daemon-specific configuration
You can also change daemon-specific settings with condor_config_val
Use the “-set” option for persistent changes, or “-rset” for memory-resident only
Used by the EventD Can be used by other entities for
various remote-administration tasks
31
Advertising Your Own Attributes in the Machine ClassAd
Add new macro(s) to the config file • This is usually done in the local config file• Can name the macros anything, so long as
the names don’t conflict with existing ones Tell the condor_startd to include these
other macros in the ClassAd it sends out• Edit the STARTD_EXPRS macro to include
the names of the macros you want to advertise (comma separated)
32
Host/IP Security in Condor You can configure each machine in your
pool to allow or deny certain actions from different groups of machines:• “read” access - querying information
– condor_status, condor_qcondor_status, condor_q, etc, etc
• “write” access - updating information– condor_submitcondor_submit, adding a node to the pool, , adding a node to the pool,
etcetc
• “administrator” access– condor_on, off, reconfig, restartcondor_on, off, reconfig, restart... ...
• “owner” access – Things a machine owner can do (Things a machine owner can do (vacatevacate))
33
The Different Versions of Condor
We distribute two versions of Condor: • Stable Series
– Heavily tested, recommended for useHeavily tested, recommended for use– 2nd2nd number of version string is even (6. number of version string is even (6.22.0).0)
• Development Series– Latest features, not necessarily well-testedLatest features, not necessarily well-tested– 2nd2nd number of version string is odd (6. number of version string is odd (6.33.0).0)– Not recommended unless you know what Not recommended unless you know what
you are doing and/or you are doing and/or needneed a new feature a new feature
34
Condor Versions (cont’d) All daemons advertise a CondorVersion
attribute in the ClassAd they publish You can also view the version string by
running ident on any Condor binary In general, all parts of Condor on a single
machine should run the same version Machines in a pool can usually run
different versions and communicate with each other
It will be made very clear when a version is incompatible with older versions
35
Talk Outline
• What is Condor and why is it good for large clusters?
• The Condor Daemons (the sys admin view)
• A look at the UW-Madison Computer Science Condor Pool and Cluster
• Some other features of Condor that help for big pools
Future work
36
Future Work
User Authentication and Authorization• Have Kerberos and X.509 authentication
in beta mode already• Will integrate w/ Condor tools to get rid of
Host/IP authorization and move to user-based authorization
• Will enable encrypted channels to securely move data (including AFS tokens)
37
Future Work (cont’d)
Digitally Signed Binaries• Condor Team will digitally sign binaries
we release• condor_master will only spawn new
daemons if they are properly signed More interesting dedicated scheduling Condor RPMs Addressing scalability
38
Obtaining Condor Condor can be downloaded from the
Condor web site at:http://www.cs.wisc.edu/condor
Complete Users and Administrators manual available
http://www.cs.wisc.edu/condor/manual Contracted Support is available Questions? Email:
condor-admin@cs.wisc.edu