1 Integrated Workload Management for Beowulf Clusters Bill DeSalvo – April 14, 2004...
-
Upload
cameron-alexander -
Category
Documents
-
view
218 -
download
0
Transcript of 1 Integrated Workload Management for Beowulf Clusters Bill DeSalvo – April 14, 2004...
1
Integrated Workload Management for Beowulf Clusters
Bill DeSalvo – April 14, 2004
© Platform Computing Inc. 20032
What We’ll Cover
Platform LSF Family of Products
What is Platform LSF HPC
Key Features & Benefits
How it Works
Q&A
© Platform Computing Inc. 20033
What is the Platform LSF Family of Products?
© Platform Computing Inc. 20034
What Problems Are We Solving?
Solve large, grand challenge, complex problems by optimizing the placement of workload in High Performance Computing environments
© Platform Computing Inc. 20035
Platform LSF HPC
Intelligent, policy-driven high performance computing (HPC) workload processing
Parallel & sequential batch workload management for High Performance Computing (HPC)
Includes patent-pending topology-based scheduling
Intelligently schedules parallel batch jobs
Virtualizes resources
Prioritizes service levels based on policies
Based on Platform LSF:
Standards-based, OGSI-compliant, grid-enabled solution
Commercial production quality product
© Platform Computing Inc. 20036
Platform Customers
© Platform Computing Inc. 20037
Platform Customers
© Platform Computing Inc. 20038
Platform Customers
© Platform Computing Inc. 20039
Platform LSF HPC
Platform LSF HPC AlphaServer SC
Platform LSF HPC for IBM
Platform LSF HPC for Linux
Platform LSF HPC for SGI
Platform LSF HPC for Cray
© Platform Computing Inc. 200310
Extensive Hardware Support
HP
HP AlphaServer SC
HP XC
HP Superdome
HP-UX 11i
SGI
SGI IRIX
SGI TRIX
SGI Altix, SGI Propack
IBM
IBM RS/6000 AIX
IBM SP2/SP3
Linux
IA-64 systens with RedHat
Intel, AMD 32-bit systems with LINUX kernel
Sun
SUN Solaris
High Performance Interconnects
Myrinet with GM
Quadrics QsNet
SGI Numa Flex SGI NumaLink
IBM SP Switch
© Platform Computing Inc. 200311
Platform LSF HPC – Linux Support
HP
HP XC Systems running Unlimited Linux
HP Itanium 2 systems running LINUX 2.4.x kernel, glibc 2.2 with RMS on Quadrics QsNet/Elan3
HP Alpha/AXP systems running LINUX 2.4.x kernel, glibc 2.2.x with RMS on Quadrics QsNet/Elan3
Linux
IA-64 systems, Kernel 2.4.x, compiled with glibc 2.2.x, tested on RedHat 7.3
x86 systems:
Kernel 2.2.x, compiled with glibc 2.1.x, tested on Debian 2.2, OpenLinux 2.4, RedHat 6.2 and 7.0, SuSE 6.4 and 7.0, TurboLinux 6.1
Kernel 2.4.x, compiled with glibc 2.1.x, tested on RedHat 7.x and 8.0, and SuSE 7.0, and RedHat Linux Advanced Server 2.1
Clustermatic Linux 3.0 Kernel 2.4.x, compiled with glibc 2.2.x, tested on RedHat 8.0
Scyld Linux, Kernel 2.4.x, compiled with glibc 2.2.x.
SGI
SGI Altix systems running Linux Kernel 2.4.x compiled with glibc 2.2.x and SGI Propack 2.2 and higher
Key Features and Benefits Platform LSF HPC
© Platform Computing Inc. 200313
Key Features
Optimized Application, System and Hardware Performance
Enhanced Accounting, Auditing & Control
Commercial Grade System Scalability & Reliability
Extensive Hardware Support
Comprehensive, Extensible and Standards-based Security
© Platform Computing Inc. 200314
Key Features – Platform LSF HPC
Optimized Application, System and Hardware Performance
Enhanced Accounting, Auditing & Control
Commercial Grade System Scalability & Reliability
Comprehensive, Extensible and Standards-based Security
© Platform Computing Inc. 200315
Adaptive Interconnect Performance Optimization
Scheduling that takes advantage of unique interconnect properties
IBM SP Switch at the POE software level
RMS on AlphaServer SC (Quadrics)
SGI topology hardware graph
Out-of-the-box functionality without any customization required
© Platform Computing Inc. 200316
Generic Parallel Job Launcher
Generic support for all different types of Parallel Job Launchers
LAMMPI, MPICH-GM, MPICH-P4, POE, SCALI, CHAMPION PRO, etc
Customizable for any vendor or publicly available parallel solution
Control - ensuring no jobs can escape the workload management system
© Platform Computing Inc. 200317
Integrated out-of-the-box Parallel Launcher Support
Full integration with IRIX MPI and array session daemon
Full integration with SGI MPI for Linux
Full integration with Sun HPC Clustertools providing full MPI control, accounting and integration with SUNs PRISM debugger
Vendor MPI libraries provide better performance than open source libraries
Vendor MPI library full support
Vendor integration supported by Platform
Seamless control and accounting
© Platform Computing Inc. 200318
HPC Workload Scheduling
Dynamic load balancing supporting heterogeneous workloads
IBM SP switch aware scheduling
Scheduling of parallel jobs
Number of CPUs, min/max, node span
Backfill on processor & memory
Processor & memory reservation
Topology aware scheduling
Exclusive scheduling
Advance Reservation
Fairshare, Preemption
Accounting
© Platform Computing Inc. 200319
High Performing, Open, Scalable Architecture
Scalable scheduler architecture
Modularized, support for over 500,000 active jobs per cluster
More than 2,000 multi-processor host per cluster - with multiple processors in each host
Process 5x more work & achieve 100% utilization
Scale with business growth
External executable support
Collect information from multiple external resources to track site specific local and global resources
Extends out-of-the-box capabilities to manage additional resources and customer application execution
Differentiation
Multiple vs single external resource collector
Job Groups
Organize jobs into higher level work units - hierarchical tree
Easy to manage and control work to increase user productivity by reducing complexity
OGSI compliance
Future-proof & protect grid investment using standards-based solutions, interoperate with third-party systems
© Platform Computing Inc. 200320
Intelligent Scheduling Policies
Fairshare (User & Project-based)
Ensure job resources are used for the right work
Guarantees resource allocation among users and projects are met
Co-ordinate access to the right number of resources for different users and projects according to pre-defined shares
Differentiation
Hierarchal & guaranteed
Policy-based Preemption
Maximizes throughput of high priority critical work based on priority and load conditions
Prevents starvation of lower priority work
Differentiation
Platform LSF supports multiple preemption policies
Goal-oriented SLA driven policies
Based on customer SLA driven goals: Deadline, Velocity, Throughput
Guarantees projects are completed on time
Reduces projects and administration costs
Provides visibility into the progress of projects
Allows the admin focus on “What work and When” needs to be done, not “how” the resources are to be allocated
Inte
llig
ent
Sch
edu
ler
Fairshare
Preemption
Resource Reservation
Advance Reservation
SLA SchedulingService Level
Agreement
MultiCluster
Other Scheduling
Modules
Plugin Schedulers
License Scheduling
© Platform Computing Inc. 200321
Advanced Self-Management
Flexible, Comprehensive Resource Definitions
Resources defined on a node basis across an entire cluster or subset of the nodes in a cluster
Auto-detectable or user defined resources
Adaptive membership – nodes join and leave Platform LSF clusters dynamically and automatically without administration effort
Dynamic or static resources
Job Level Exception Management
Exception-based error detection to take automatic, configurable, corrective actions
Increased job reliability & predictability
Improved visibility on job and system errors & reduced administration overhead and costs
Automatic Job Migration and Requeue
Automatically migrate and requeue jobs based on policies in the event of host or network failures
Reduce user and administrator overhead in managing failures & reduce risk of running critical workloads
Master Scheduler Failover
Automatically fail over to another host if the master host is unavailable
Continuous scheduling service and execution of jobs & eliminate manual intervention
© Platform Computing Inc. 200322
Backfill
Policy configured at the queue level and applies to all jobs in a queue
Smaller sequential jobs are ‘backfilled’ behind larger parallel jobs
Improves hardware utilization
Users provided with an accurate time when their job will start
Key New Feature & BenefitsPlatform LSF V6.0
© Platform Computing Inc. 200324
Feature Overview
OGSI Compliance
Goal-Oriented SLA-Driven Scheduling
License-Aware Scheduling
Job-Level Exception Management (Self Management Enhancement)
Job Group Support
Other Scheduling Enhancements
Queue-Based Fairshare
User Fairshare by Queue Priority
Job Starvation Prevention plug-in
© Platform Computing Inc. 200325
Feature Overview (Cont.)
HPC Enhancements
Dynamic ptile Enforcement
Resource Requirement Specification for Advance Reservation
Thread Limit Enforcement
General Parallel Support
Parallel Job Size Scheduling
Job Limit Enhancements
Non-normalized Job Run Limit
Resource Allocation Limit Display
Administration and Diagnostics
Scheduler Dynamic Debug
Administrator Action Messages
© Platform Computing Inc. 200326
Goal-Oriented SLA-Driven Scheduling
What is it?
A new scheduling policy.
Unlike current scheduling policies based on configured shares or limits, SLA-driven scheduling is based on customer provided goals:
Deadline based goal: Specify the deadline for a group of jobs.
Velocity based goal: Specify the number of jobs running at any one time.
Throughput based goal: Specify the number of finished jobs per hour.
This scheduling policy works on top of queues and host partitions.
Benefits
Guarantees projects are completed on time according to explicit SLA definitions.
Provides visibility into the progress of projects to see how well projects are tracking to SLAs
Allows the admin focus on “What work and When” needs to be done, not “how” the resources are to be allocated.
Guarantees service level deliveries to the user community, reduces the risks of projects and administration cost.
© Platform Computing Inc. 200327
User case
Problem: we need to finish all simulation jobs before 15:00pm.
Solution: Configure a deadline service class in lsb.serviceclasses file.
Begin ServiceClass
NAME=simulation
PRIORITY=100
GOALS = [deadline timeWindow (13:00 – 15:00)]
DESCRIPTION = A simple deadline demo
End ServiceClass
Submitting and monitoring jobs
$bsub –sla simulation –W 10 –J A[1-50] mySimulation
$date;bsla
Wed Aug 20 14:00:16 EDT 2003
SERVICE_CLASS_NAME: simulation
GOAL: DEADLINE ACTIVE_WINDOW: (13:00 – 15:00)
STATUS: Active:Ontime
DEAD_LINE: (Wed Aug 20 15:00)
ESTIMATED_FINISH_TIME: (Wed Aug 20 14:30)
Optimum Number of Running Jobs: 5
NJOBS PEND RUN SSUSP USUSP FINISH
50 25 5 20
© Platform Computing Inc. 200328
Job-Level Exception Management (Self Management Enhancement)
What is it?
Platform LSF can monitor the exception behavior and take action accordingly.
Benefits
Increased reliability of job execution
Improved visibility on job and system errors
Reduced administration overhead and costs
How it works
Platform LSF V6 handles following exceptions:
“Job eating” machine (or “black-hole” machine): for some reason, jobs keep exiting abnormally on a machine (e.g. no processes, mount daemon dies, etc.)
Job underrun (job run time less than configured minimum time)
Job overrun (job run time more than configured maximum time)
Job run idle (job run without cpu usage increasing).
© Platform Computing Inc. 200329
Job-Level Exception Management (Self Management Enhancement) (Cont.)
Use Case 1:
Requirement: If the host has more than 30 jobs exited in past 5 minutes, I want LSF to close that machine, then notify me and tell me the machine name.
Solution:
Configure host exceptions (EXIT_RATE in lsb.hosts).
Begin Host
HOST_NAME MXJ EXIT_RATE # Keywords
Default ! 6
End Host
Configure the JOB_EXIT_RATE_DURATION = 5 in lsb.params (default value is 10 minutes)
© Platform Computing Inc. 200330
Job-Level Exception Management (Self Management Enhancement) (Cont.)
Use Case 2:
Requirement: If any job runs more than 3 hours, I want LSF to notify me and tell me the jobID.
Solution:
Configure job exceptions (lsb.queues)
Begin Queue
…
JOB_OVERRUN = 3*60 # run time in minutes
End Queue
© Platform Computing Inc. 200331
Job Starvation Prevention Plug-in
What is it?
External scheduler plug-in allows users to define their own equation for job priority
Benefits
Low priority work is guaranteed to run after ‘waiting’ for a specified time ensuring that the job does not wait forever (i.e. starvation).
How it works
By default, the scheduler provides the following calculation
Job priority =A * (q_priority) *MIN(1, int(wait_time/T0))
* (B*requested_processors+MAX(C*wait_time*(1+1/run_time),D)
+E*requested_memory)
Where A, B, C, D, E are coefficients. T0 is the grace period. Default run_time= INFINIT
Admin can define different coefficients for each queue with the following format:
MANDATORY_EXTSCHED=JOBWEIGHT[A=val1; B=val2; …]
© Platform Computing Inc. 200332
Job Starvation Prevention Plug-in
Use Case:
Requirement: Lowest priority queue can wait no more than 10 hours.
Solution: If highest priority queue PRIORITY = 100, lowest priority queue PRIORITY = 20. Configure the following in Lowest queue:
MANDATORY_EXTSCHED=JOBWEIGHT[A=1;B=0;C=10;D=1;E=0;T0=0.1]
After waiting 10 hours, the job in Lowest queue will have higher priority than jobs in highest priority queue.
Note: The formula for calculating job weight is open source and customers can customize it.
© Platform Computing Inc. 200333
Resource Requirement Specification For Advance Reservation
What is it?
Enable users to select the hosts for advance reservation based on the resource requirement.
Benefit
More flexible to reserve the host slots for the mission critical job.
How it works
brsvadd command supports select string: brsvadd –R “select[type==LINUX]” –n 4 –u xwei –b 10:00 –e 12:00
© Platform Computing Inc. 200334
Key Features – Platform LSF HPC
Enhanced Accounting, Auditing & Control
Optimized Application, System and Hardware Performance
Commercial Grade System Scalability & Reliability
Comprehensive, Extensible and Standards-based Security
© Platform Computing Inc. 200335
Job Termination Reasons
Accounting log with detailed audit & error information for every job in the system
Indicates why a job was terminated
Difference between an abnormal termination or caused by Platform LSF HPC
© Platform Computing Inc. 200336
Key Features – Platform LSF HPC
Optimized Application, System and Hardware Performance
Enhanced Accounting, Auditing & Control
Comprehensive, Extensible and Standards-based Security
Commercial Grade System Scalability & Reliability
© Platform Computing Inc. 200337
Enterprise Proven
Running on several of the top 10 supercomputers in the world on the “TOP500” (#2,4,5,6)
More than 250,000 licenses in use spanning 1,500 customer sites
Scales to over 100 clusters, 200,000 CPUs and 500,000 active jobs per cluster
11+ years experience in distributed & grid computing
Risk free investment – proven solution
Commercial production quality
© Platform Computing Inc. 200338
Key Features – Platform LSF HPC
Optimized Application, System and Hardware Performance
Enhanced Accounting, Auditing & Control
Commercial Grade System Scalability & Reliability
Comprehensive, Extensible and Standards-based Security
© Platform Computing Inc. 200339
Comprehensive, Extensible, Standards-based Security
Scalable scheduler architecture
Multiple scheduler plug-in API support
External executable support
Web GUI
Open source components
Risk free investment – proven solution
Commercial grade
Scalability and flexibility as a business grows
How It Works Platform LSF HPC
© Platform Computing Inc. 200341
Fault Tolerance via Master Election
slaveLIM
sbd
Host iHost i
slaveLIM
sbd
Host NHost N
MasterLIM
sbd
Host 1Host 1
mbd
Am I master ?
master announcementmaster announcement
exchange load info
mbsched
© Platform Computing Inc. 200342
Virtual Server Technology
LIM: Collects & centralizes status of all resources in cluster RES: Transparent remote task execution
ELIM
MasterLIM
Load Information
Free memory
Idle Time
Disk I/O RateFree swap space
Number of CPUs
Host Status
CustomStatus
RES RES
Cluster APIs
RES
SlaveLIM
SlaveLIM
SlaveLIM
SlaveLIM
RES
System Monitor
Workload Management
Admin Tools
© Platform Computing Inc. 200343
Executing Work
SBD
SBD
MasterLIM
SlaveLIM
SlaveLIM
MBD
ELIM
Chooses best, available resource to process the job
Gaussian Distributi
onJob
Computational
Chemistry Job ProteinModeling Job
BLASTSequence Job
Jobs
Clients
SlaveLIM
SlaveLIM
SBD SBD SBD
© Platform Computing Inc. 200344
Grid-enabled, Scalable Architecture
Open, modular plug-in schedulers scale
with the growth of your business
© Platform Computing Inc. 200346
Scheduler Framework
The framework hides the complexity of interacting with core services.
Resource Broker responsible for resource information collection from other core services.
Minimize the inter-dependencies between scheduling policies
Maximize extensibility through the plug-in scheduler module stack
Scheduler Framework
Scheduler Modules
Resource Broker
© Platform Computing Inc. 200347
The Four Scheduling Phases
1. Pre-Processing
2. Matching / Limits
3. Order / Allocation
4. Post-Processing
Pre-Selected Jobs
Scheduling Decisions/Job Control DecisionsScheduling Decisions/Job Control Decisions
Localized setup
• Prioritize jobs and allocate resources
• Match eligible resources to nodes
• Allocation adjustments
© Platform Computing Inc. 200348
Multiple Scheduling Modules
Pre-Processing
Matching / Limits
Order / Allocation
Post-Processing
Internal Module
Pre-Processing
Matching / Limits
Order / Allocation
Post-Processing
...
...
...
...
Add-onModule 1
Pre-Processing
Matching / Limits
Order / Allocation
Post-Processing
Add-onModule N
• Vendor specific matching policies (without changing the existing scheduler
• Support for external scheduler
© Platform Computing Inc. 200349
Maui Integration
MBD
SCH_FM
RMGetInfo
Post-Processing
Pre-processing
Order jobs
UIProcessClients
QueueScheduleSJobsQueueScheduleRJobsQueueScheduleIJobs
QueueBackFill
Job, Host, Res Info
Decisions and ack
Sync
MAUI PluginEvent Handle
(wait until GO event)
MAUIScheduler
Linux-specific Solutions
© Platform Computing Inc. 200351
Controlling an MPI job
On a distributed system (Linux cluster) there are many problems to address:
1. Job launch across multiple nodes
2. Gather resource usage while job executes
3. Propagate signals
4. Job “clean-up” to eliminate “dangling” MPI processes
5. Comprehensive job accounting
© Platform Computing Inc. 200352
Resource manager
Resource manager
submitsubmit
mpirunmpirun
a.outa.out a.outa.out
JobscriptJobscript
“traditional” MPI sequence
Joblauncher
Joblauncher
© Platform Computing Inc. 200353
Platform LSF HPC for Linux - MPICH-GM
mbatchdmbatchd
sbatchdsbatchd
Job scriptJob script
mpirunmpirun
TSTS
resres
gmmpirun_wrappergmmpirun_wrapper
a.outa.out
TSTS
resres
PIMPIM
bsubbsub
a.outa.out
pampam
resres
PIMPIM
© Platform Computing Inc. 200354
Execution Host H1
PIM LIM
master LIM
Master Host
lsblib
LIM PIM
bsub
SBD
MBD SBD
SBD child
pam
high
med
hpc_queue
Queues
MBSCHD
Submission host
H2
PJL
TaskStarter
a.out: process 1
TaskStarter
a.out: process 2
PJL wrapper Root resRoot res
LIM
Signals and rusage collection
Hostname & pid
Hostname & pid
Platform LSF HPC for Linux/Myrinet - Generic PJL
© Platform Computing Inc. 200355
Execution Host H1
PIM LIM
master LIM
Master Host
lsblib
LIM PIM
bsub
SBD
MBD SBD
SBD child
pam
high
med
hpc_queue
Queues
MBSCHD
Submission host
H2
esub
elim
elim
Mpirun.ch_gm
TaskStarter
a.out: process 1
TaskStarter
a.out: process 2
Gmmpirun_wrapper
Root resRoot res
LIM
elim
Set LSF_PJL_TYPETo mpich_gm
Report resource availability
Signals and rusage collection
Report resource availability
Hostname & pid
Hostname & pid
rsh
Platform LSF HPC for Linux/Myrinet - MPICH_GM
Mpirun.lsf
© Platform Computing Inc. 200356
Platform LSF HPC for Linux/Myrinet - LAM/MPI
Execution Host H1
PIM LIM
master LIM
Master Host
lsblib
LIM PIM
bsub
SBD
MBD SBD
SBD child
pam
high
med
hpc_linux
Queues
MBSCHD
Submission host
H2
esub
elim
elim
mpirun
TaskStarter
a.out: process 1
TaskStarter
a.out: process 2
Lammpirun_wrapper
Root resRoot res
LIM
elim
Set LSF_PJL_TYPETo lammpi
Report resource availability
Signals and rusage collection
Report resource availability
Hostname & pid
Hostname & pid
lamd
lamd
Mpirun.lsf
© Platform Computing Inc. 200357Execution Host H1
PIM LIM
master LIM
Master Host
lsblib
LIM PIM
bsub
SBD
MBDSBD
SBD child
pam
high
med
low
Queues
MBSCHD
Submission host
H2
mpimon
TaskStarter
a.out: process 1
TaskStarter
a.out: process 2
Scali mpi wrapper
Root resRoot res
LIM
Signals and rusage collection
Hostname & pid
Hostname & pid
Platform LSF HPC for Linux/Myrinet - Scali MPI
mpidmpid
mpisubmon mpisubmon
© Platform Computing Inc. 200358
Platform LSF HPC for Linux/QsNet
LSF Execution host /RMS node n0
PIM LIM
master LIM
Master Host
lsblib
LIM PIM
bsub
SBD
MBD
SBD
SBD child – exec() res
Res – rms_run()high
med
low
Queues
MBSCHD
Submission host
RLA
Job’s Allocation
User Job
Node n1
Node n2
RMS plugin
© Platform Computing Inc. 200359
Scyld Beowulf Integration
• Scyld Beowulf handles the systems management challenge effectively
• No OS to distribute / synchnronize• Central point of control from master• Single process space makes it appear as large SMP
• Platform integrates with Scyld treating cluster as SMP and allocating resources
• Integrate with mpirun, mpprun or bpsh to start tasks• Collect resource usage from BPROC• Collect load information via BPROC APIs• Singe user interface across Sycld & non-Scyld env.
© Platform Computing Inc. 200360
Platform LSF HPC for Linux/BProc
Bproc Front-end Node
PIM LIM
master LIM
Master Host
lsblib
LIM PIM
bsub
SBD
MBDSBD
high
med
low
Queues
1A
1B
1C
2
3
4
6B
6C
MBSCHD
5
Submission host
Job file
H3
Res
SBD child –exec() res
allocated nodes
Computing Nodes
Bpsh/mpirun
User Job Processes
esub
Modify submission options
© Platform Computing Inc. 200361
More info at:
• www.platform.com/customers
• www.platform.com/barriers
Q & A