Condor Week Summary
-
Upload
tiger-nolan -
Category
Documents
-
view
25 -
download
0
description
Transcript of Condor Week Summary
Condor Week Summary
March 14-16, 2005
Madison, Wisconsin
Overview
• Annual meeting at UW-Madison.
• About 80 participants at this year’s meeting.
• Participants come from universities, research labs and industry.
• Single plenary sessions with talks from users and developers.
Overview
• Topics ranged from basic to advanced.
• Selected highlights in today’s talk.
• Slides from this year’s talks can be found at http://www.cs.wisc.edu/condor/CondorWeek2005
CondorWeek Topics
• distributed computing and Condor
• data handling and Condor
• 3rd party contributions to Condor
• reports from the field
• Condor roadmap
Condor Grids (by Alan De Smet)
• Various alternatives for accessing remote computing resources (distributed computing, flocking, Globus/Condor-G, Condor-C, etc).
• Discussed pros and cons of each approach (ACF uses Globus/Condor-G).
Condor-G Status and News
• Globus Toolkit 2 is stable
• Globus Toolkit 3 is supported– But we think most people are moving to…
• Globus Toolkit 4 in progress– GT4 beta works now in Condor 6.7.6– Condor will officially support soon after
official GT4 release.
Glidein (by Dan Bradley)
• You have access to a cluster running some other batch system.
• You want Condor features, such as– queue management– matchmaking– checkpoint migration
What Does Glidein Do?
• Installation and setup of Condor.– May be done remotely.
• Launching Condor.– Through Condor-G submission to Globus.– Or you run the startup script however you like.
Condor and DBMS (by Jeff Naughton)
• Premise: A running Condor system is awash in data:– Operational data
– Historical data
– User data
• DBMS technology can help capture, organize, manage, archive, and query this data.
Three potential levels of involvement
1. Passively collect and organize data, expose it through DB query interfaces.
2. Move/extend some data-related portions of Condor to DBMS (Condor writes to and reads from DBMS)
3. Provide services to help users manage their data.
Why do this?
• For Condor administrators– Easier to analyze and trouble shoot;– Easier to audit;– Easier to explore current and past system status
and behavior.
Our projects and plans
• Quill: Transparently provide a DBMS query interface to job_queue and history data. [ready to deploy!]
• CondorDB: Transparently captures and provides interface to critical data from all Condor daemons. [status: partial prototype working in our own “sandbox”]
Quill• Job ClassAds
information mirrored into an RDBMS
• Both active jobs and historical jobs
• Benefits BOTH scalability and accessibility
QuillSchedd
Job Queue
log
RDBMS
Startd …
Master
Queue +
History Tables
Longer-term plans
• Tight integration of DBMS technology and Condor [status: thinking hard!].
• DBMS-inspired data management services to help Condor users manage their own data. [status: thinking really hard!]
Stork (by Tevfik Kosar)
• Condor tool for data movement.
• First available in v. 6.7.6. Will be included in next stable release (6.8.0).
• Prototypes deployed at various sites.
Bioinformatics: BLASTHigh Energy Physics: LHC
Astronomy: LSST2MASSSDSSDPOSSGSC-IIWFCAM VISTANVSSFIRSTGALEXROSATOGLE...
LSST2MASSSDSSDPOSSGSC-IIWFCAM VISTANVSSFIRSTGALEXROSATOGLE...
Educational Technology: WCER EVP
500 TB/year
2-3 PB/year11 PB/year
20 TB - 1 PB/year
Stork: Data Placement Scheduler• First scheduler specialized for
data movement/placement.
• De-couples data placement from computation.
• Understands the characteristics and semantics of data placement jobs.
• Can make smart scheduling decisions for reliable and efficient data placement.
http://www.cs.wisc.edu/condor/stork
Stork can also:
• Allocate/de-allocate (optical) network links• Allocate/de-allocate storage space• Register/un-register files to Meta Data
Catalog• Locate physical location of a logical file
name• Control concurrency levels on storage servers
Storage Management (by Jeff Weber)
• NEST (Network Storage Technology) is another project at UW-Madison.
• To be coupled to Condor and Stork.
• No stable release available yet.
Overview of NeST
• NeST: Network Storage Technology• Lightweight: Configuration and installation can be
performed in minutes.• Multi-protocol: Supports Chirp, GridFTP, NFS, HTTP
– Chirp is NeST’s internal protocol
• Secure: GSI authentication• Allocation: NeST negotiates “mini storage contracts”
between users and server.
Why storage allocations ?
• Users need both temporary storage, and long-term guaranteed storage.
• Administrators need a storage solution with configurable limits and policy.
• Administrators will benefit from NeST’s autonomous reclamations of expired storage allocations.
Storage allocations in NeST
• Lot – abstraction for storage allocation with an associated handle– Handle is used for all subsequent operations on
this lot
• Client requests lot of a specified size and duration. Server accepts or rejects client request.
Condor and SRM (by Derek Wright)
• Coordinate computation and data movement with Condor.
• Condor ClassAd hook (STARTD_CRON_JOBS) queries DRM for files in cache and publishes it in ClassAd for each node.
• FSM keeps track of all files required by jobs in the system and contacts HRM if required files are missing.
• Regular Condor matchmaking schedules jobs where files exist.
3rd party contributions to Condor
• High availability features (Technion Institute).
• Privilege separation in Condor (Univ. of Cambridge).
• Optimizing Condor throughput (CORE Feature Animation).
• Web interface to Condor (Univ. College of London).
Collector
Negotiator
Current Condor Pool
Startd and ScheddStartd and Schedd
Startd and ScheddStartd and ScheddStartd and Schedd
Startd and ScheddStartd and ScheddCentral Manager
Highly Available Condor Pool
Startd and ScheddStartd and Schedd
Startd and ScheddStartd and Schedd Startd and Schedd
Startd and ScheddStartd and Schedd
IdleCentral
Manager
IdleCentral
Manager
ActiveCentral
Manager
Highly AvailableCentral Manager
Highly Available Central Manager
• Our solution - Highly Available Central Manager– Automatic failure detection– Transparent failover to backup matchmaker (no
global configuration change for the pool entities)– “Split brain” reconciliation after network partitions– State replication between active and backups– No changes to Negotiator/Collector code
What is privilege separation?
• Isolation of those parts of the code that run at different privilege levels
rootCondor daemons
Condor job
• No privilege separation:
rootCondor
daemonsCondor
job
• Privilege separation:
Throughput Optimization (CORE Feature Animation)Performance Before => After:● Removed Groups: 6 => 5.5 min● Significant Attributes: 5.5 => 3 min● Schedd Algorithm: 3 => 1.5 min● Separate Servers: 1.5 => 0.6 min● Cycle delay: 0.6 => 0.33 min● Server Loads: <1 Middleware
<2 Central Manager
Web Service Interface to Condor
• Facilitate the development of third-party applications capable of interacting with Condor (remotely).
– E.g. build higher-level application specific scheduler that submits jobs to multiple Condor pools based on application semantics
– These can be built using a wide range of languages/SOAP packages
– BirdBath has been tested on:
• Java (Apache Axis, XSUL)
• Python (ZSI)
• C# (.Net)
• C/C++ (gSOAP)
• Condor accessible from platforms where its command-line tools are not supported/installed
Condor Plans (by Todd Tannenbaum)
• Condor 6.8.0 (stable series) available in May 05.
• Fail-over, persistence and other features.
• Improved scalability and accessibility (API’s, Grid middleware, Web-based interfaces, etc).
• Grid universe and security improvements.
• Condor can now transfer job data files larger than 2 GB in size.– On all platforms that support 64bit file
offsets• Real-time spooling of stdout/err/in in
any universe incl VANILLA– Real-time monitoring of job progress
• Condor Installer on Win32 uses MSI (thanks Micron!)
• condor_transfer_data (DZero)• STARTD_VM_EXPRS (INFN)• condor_vacate_job tool• condor_status -negotiator
BAM! More tasty Condor goodness!
And More…• New startd policy expression MaxJobRetirementTime.
– specifies the maximum amount of time (in seconds) that the startd is willing to wait for a job to finish on its own when the startd needs to preempt the job
• -peaceful option to condor_off, condor_restart• noop_job = True• Preliminary support for the Tool Daemon Protocol (TDP)
– TDP goal is to provide a generic way for scheduling systems (daemons) to interact with monitoring tools.
– specify a ``tool'' that should be spawned along-side their regular Condor job. – On Linux, ability to allow a monitoring tool to attach with ptrace() before
the job's main() function is called.
Hey Jobs! We’re watching you!• condor_starter enforce limits
– Starter is already monitoring many job characteristics (image size, cpu usage, etc)
– Threshold expressions• Use more resources than you said you
would, and BAM!
• Local Universe– Just like Scheduler Universe, but there is a
condor_starter– All advantages of the starter
schedd
starter
job
Submit
startd
starter
job
Execute
Hey, job, behave or else!
ClassAd Improvements in Condor!• Conditionals
– IfThenElse(condition,then,else)• String functions
– Strcat(), strcmp(), toUpper(), etc.• StringList functions
– Example of a “string list” (CSV style)• Mylist = “Joe, Jon, Jeff, Jim, Jake”
– StrListContains(), StrListAppend(), StrListRemove(), etc.
• Others– Type test, some math functions
Accounting Groups andGroup Quota Support
• Account Group (w/ CORE Feature Animation)• Account Group Quota (inspiration CDF @ Fermi)
– Sample Problem: Cluster w/ 500 nodes, Chemistry Dept purchased 100 of them, Chemistry users must always be able to use them
– Could use Machine Rank…• but this ties to specific machines
– Or could use new group support• Each group can be given a quota in config file• Job ads can specify group membership• Group quotas are satisfied first• Accounting by user and by group
Improved Scalability
• Much faster negotiation– SIGNIFICANT_ATTRIBUTES determined
automatically– Schedd uses non-blocking TCP connects to the
startd– Negotiator caching– Collector Forks for queries– More…
What’s brewing for after v6.8.0?
• More data, data, data – Stork distributed w/ v6.8.0, incl DAGMan support– NeST manage Condor spool files, ckpt servers– Stork used for Condor job data transfers
• Virtual Machines (and the future of Standard Universe) • Condor and Shibboleth (with Georgetown Univ)• Least Privilege Security Access (with U of Cambridge)• Dynamic Temporary Accounts (with EGEE, Argonne)• Leverage Database Technology (with UW DB group)• ‘Automatic’ Glideins (NMI Nanohub – Purdue, U of Florida)• Easier Updates• New ClassAds (integration with Optena)• Hierarchical Matchmaking
Can I commit this to CVS??