J. Skovira 5/05 v11 Introduction to IBM LoadLeveler Batch Scheduling System.

34
J. Skovira 5/05 v1 J. Skovira 5/05 v1 1 Introduction to IBM LoadLeveler Batch Scheduling System

Transcript of J. Skovira 5/05 v11 Introduction to IBM LoadLeveler Batch Scheduling System.

J. Skovira 5/05 v1J. Skovira 5/05 v1 11

Introduction to IBM LoadLeveler

Batch Scheduling System

J. Skovira 5/05 v1J. Skovira 5/05 v1 22

Agenda

l Batch Scheduling BasicsBatch Scheduling Basics

l LoadLeveler basicsLoadLeveler basics

l LoadLeveler configurationLoadLeveler configuration

Basic CommandsBasic Commandsl Job SubmissionJob Submissionl Job cancellationJob cancellationl Job monitoringJob monitoring

l Job command filesJob command files

l Advanced FunctionsAdvanced Functions

l Questions and AnswersQuestions and Answers

J. Skovira 5/05 v1J. Skovira 5/05 v1 33

Who Needs a Job Scheduler?

Single Machine

IBM

Job 1Job 2….Job N

HPC Machine

OS multi-tasks single CPU: time-shared scheduling

User 1:Job 1Job 2….Job N

User 2:Job 1Job 2….Job N

User 3:Job 1Job 2….Job N

Parallel DimensionMany Machines and Users:

More Jobs

Parallel Dimension

User may impact a distant job

Scheduler runs jobs according to: Scheduling Theory Site-defined Policy

J. Skovira 5/05 v1J. Skovira 5/05 v1 44

Scheduling Terms

HPC Cluster

Resource manager

Scheduler

Start jobs on specific resources at specific times

Job Queue

Job 1Job 2Job 3….

Batch Scheduler

J. Skovira 5/05 v1J. Skovira 5/05 v1 55

More Tasks for User?

Job Command File is a small set of job directivesJob Command files can be “borrowed” from samples

Simple Command files take predefined defaultsExperienced users may enhance command files

Application Code

Job Meta Data

Once control is handed to the job, scheduler is out of the way

J. Skovira 5/05 v1J. Skovira 5/05 v1 66

LoadLeveler Components

Loadleveler Central Manager Negotiator Daemon

IBM

IBM Cluster

Worker NodesStartd daemon

Schedd Machine Schedd Machine

High Performance

Switch

J. Skovira 5/05 v1J. Skovira 5/05 v1 77

LoadLeveler Components

J. Skovira 5/05 v1J. Skovira 5/05 v1 88

Priority and Scheduling

Jobs arrive: from different users at different time in different job classes with different priorities

Job A 8 2Job B 12 1Job C 10 1Job D 4 1Job E 4 5

JobE

JobA

JobC

JobD

JobB

Loadleveler sorts the job queue

Loadleveler schedules the jobs in queue order

J. Skovira 5/05 v1J. Skovira 5/05 v1 99

Reservation vs Backfill

Reservation (standard) Scheduling Top job waits a short time for resources to free Defer if not available

BackfillTop job starts if it can

If not enough resources, compute when available which resources job will use

Backfill jobs onto available nodes

Backfill superior for parallel machines

J. Skovira 5/05 v1J. Skovira 5/05 v1 1010

BackfillBackfill

Job Queue

Job Nodes Time

Job A 8 2Job B 12 1Job C 10 1Job D 4 1Job E 4 5

J. Skovira 5/05 v1J. Skovira 5/05 v1 1111

Backfill

Job Queue

Job Nodes Time

Job A 8 2Job B 12 1Job C 10 1Job D 4 1Job E 4 5

J. Skovira 5/05 v1J. Skovira 5/05 v1 1212

Job Command File Basics

Command file contains job “directives”

Basic items include:ShellClassInput/output directoriesNotification controlQueue keyword

2 ways to specify job executable:Executable keywordScript invocation after the keyword

Application Code

Job Command File

J. Skovira 5/05 v1J. Skovira 5/05 v1 1313

Basic Job Command File

#!/bin/ksh# @ class = demo# @ queueperlspin2 > /tmp

J. Skovira 5/05 v1J. Skovira 5/05 v1 1414

More Job Command File Keywords

Requirements allow you to select:I/O directivesNode requirementsWallclock limitLocally defined requirementsEtc…

notification controls what LL sends about the jobFrom never to always

notify_user tells LL where to send job infoAn email address

J. Skovira 5/05 v1J. Skovira 5/05 v1 1515

Serial Job Command File

#!/bin/ksh# @ error = ./out/job2.$(jobid).err# @ output = ./out/job2.$(jobid).out# @ wall_clock_limit = 180# @ class = demo# @ notification = complete# @ notify_user = [email protected]# @ queueperlspin2

J. Skovira 5/05 v1J. Skovira 5/05 v1 1616

Communication on the System

Each node has a connection to the high-performance switch

There are 2 ways to use the switchip mode "unlimited" channels slower communication performance

User space mode limited number of channels faster than ip mode

Can be selected in job command file

J. Skovira 5/05 v1J. Skovira 5/05 v1 1717

Parallel Job Command File Keywords

nodeHow many nodes your job requires

tasks_per_node How many tasks will run on each node

networkHow your job will communicate

wall_clock_limitAn estimate of how long your job runs

J. Skovira 5/05 v1J. Skovira 5/05 v1 1818

The Network Keyword

network.protocol = network_type, usage, mode

protocol: MPI, LAPI, PVM

network_type: sn_single or sn_all for switch adapter

usage: shared or not_shared

mode: IP, US

An example:

# @ network.MPI = sn_single, shared, us

J. Skovira 5/05 v1J. Skovira 5/05 v1 1919

Parallel Job Command File

#!/bin/ksh# @ job_type = parallel # @ node = 1# @ tasks_per_node = 4# @ error = ./out/job3.$(jobid).err# @ output = ./out/job3.$(jobid).out# @ wall_clock_limit = 05:00# @ class = demo# @ notification = complete# @ notify_user = [email protected]# @ network.MPI = sn_all,shared,us# @ queuepoe perlspin2

J. Skovira 5/05 v1J. Skovira 5/05 v1 2020

Basic Loadleveler Commands

llsubmit – submits a job to Loadleveler

llcancel – cancels a submitted job

llq – queries the status of jobs in the job queue

llstatus – queries the status of machines in the cluster

J. Skovira 5/05 v1J. Skovira 5/05 v1 2121

llq example

v01n08:/u/skoviraj $ llsubmit mybasic.cmd

llsubmit: The job "v01n08.vendor.pok.ibm.com.205" has been submitted

Id Owner Submitted ST PRI Class Running On ------------------------ ---------- ----------- -- --- ------------ ----------- v01n08.204.0 skoviraj 11/11 22:29 R 50 No_Class v01n02 v01n08.205.0 skoviraj 11/11 22:30 R 50 No_Class v01n02 v01n08.203.0 skoviraj 11/11 22:28 I 50 No_class

3 job steps in queue, 1 waiting, 0 pending, 2 running, 0 held

v01n08:/u/skoviraj $ llq

J. Skovira 5/05 v1J. Skovira 5/05 v1 2222

llstatus example

v01n08:/u/skoviraj/suspender1.0/suspender_stuff $ llstatus v01n02

Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys v01n02.vendor.pok.ibm.com Avail 0 0 Run 1 0.00 9999 R6000 AIX43

v01n08:/u/skoviraj/suspender1.0/suspender_stuff $ llstatus | more

Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys v01n01.vendor.pok.ibm.com Avail 0 0 Idle 0 0.05 9999 R6000 AIX43 v01n02.vendor.pok.ibm.com Avail 0 0 Run 1 0.00 9999 R6000 AIX43 v01n03.vendor.pok.ibm.com Avail 0 0 Idle 0 0.00 9999 R6000 AIX43 v01n04.vendor.pok.ibm.com Avail 0 0 Idle 0 0.00 9999 R6000 AIX43 v01n05.vendor.pok.ibm.com Avail 0 0 Idle 0 0.02 9999 R6000 AIX43 v01n06.vendor.pok.ibm.com Avail 0 0 Idle 0 0.05 9999 R6000 AIX43 v01n07.vendor.pok.ibm.com Avail 1 0 Idle 0 0.06 155 R6000 AIX43 v01n08.vendor.pok.ibm.com Avail 1 0 Idle 0 0.00 83 R6000 AIX43 v01n09.vendor.pok.ibm.com Avail 0 0 Idle 0 0.00 9999 R6000 AIX43

J. Skovira 5/05 v1J. Skovira 5/05 v1 2323

llctl Examples

llctl -h hostname command

Useful Commands:

reconfig - Forces all daemons to reread the configuration files.

start - Starts the LoadLeveler daemons on the specified machine.

stop - Stops the LoadLeveler daemons on the specified machine.

Commands sometimes used:

flush - Terminates running jobs on this machine, places jobs in idle

recycle - Stops all LoadLeveler daemons and restarts them.

J. Skovira 5/05 v1J. Skovira 5/05 v1 2424

llctl Example

drain [schedd|startd [classlist |allclasses]]

With no options: (1) no more LoadLeveler jobs can begin running on this machine, (2) no more LoadLeveler jobs can be submitted through this machine.

When you issue drain schedd, the following happens: (1) the schedd machine accepts no more LoadLeveler jobs for submission. (2) jobs in the Starting or Running state in the queue are allowed to continue running. (3) jobs in the Idle state in the schedd queue are drained

When you issue drain startd, the following happens: (1) the startd machine accepts no more LoadLeveler jobs to be run (2) jobs already running on the startd machine are allowed to complete.

J. Skovira 5/05 v1J. Skovira 5/05 v1 2525

More Loadleveler Commands

llclass - returns information about available classes

llprio - changes the user priority of a job step

J. Skovira 5/05 v1J. Skovira 5/05 v1 2626

llclass Example

v60n129:/u/skoviraj $ llclass -l X_Class=============== Class X_Class =============== Name: X_Class Priority: 0 Exclude_Users: Include_Users: Exclude_Groups: Include_Groups: Admin: NQS_class: F NQS_submit: NQS_query: Max_processors: -1 Maxjobs: -1Resource_requirement: Class_comment: Class_ckpt_dir: Ckpt_limit: undefined, undefined Wall_clock_limit: 11+13:46:39, 11+13:46:39 (999999 seconds, 999999 seconds) Job_cpu_limit: undefined, undefined

v60n129:/u/skoviraj $ llclassName MaxJobCPU MaxProcCPU Free Max Description d+hh:mm:ss d+hh:mm:ss Slots Slots--------------- -------------- -------------- ----- ----- ---------------------inter_class undefined undefined 192 192X_Class undefined undefined 192 192

J. Skovira 5/05 v1J. Skovira 5/05 v1 2727

llprio Example

v01n07:/u/skoviraj/suspender1.0/suspender_stuff $ llq Id Owner Submitted ST PRI Class Running On v01n07.137.0 skoviraj 11/11 22:51 I 50 No_class 1 job steps in queue, 1 waiting, 0 pending, 0 running, 0 held

v01n07:/u/skoviraj/suspender1.0/suspender_stuff $ llprio -p 100 v01n07.137.0 llprio: Priority command has been sent to the central manager.

v01n07:/u/skoviraj/suspender1.0/suspender_stuff $ llq Id Owner Submitted ST PRI Class Running On v01n07.137.0 skoviraj 11/11 22:51 I 100 No_class 1 job steps in queue, 1 waiting, 0 pending, 0 running, 0 held

J. Skovira 5/05 v1J. Skovira 5/05 v1 2828

Advanced Topics

Job Preemption

Job Checkpointing

Submit filter

Loadleveler APIs (data access, scheduling)

Workload Manager (WLM) integration

Advance Reservation

Consumable resource control

J. Skovira 5/05 v1J. Skovira 5/05 v1 2929

Job Suspension

4 way restarts

16 way job runs

4 Node job runs

4 Node suspended

16 way job completes

J. Skovira 5/05 v1J. Skovira 5/05 v1 3030

Job Checkpoint

4 way restarts from saved state

16 way job runs

4 Node job runs

4 Node Checkpoints and ends

16 way job completes

4 Node job state saved

GPFS

J. Skovira 5/05 v1J. Skovira 5/05 v1 3131

Submit Filter

$NetKey = FALSE;while (<STDIN>) { chomp($value = $_); if ( $value =~ /network/ ) { # If we find the network keyword.... $NetKey = TRUE; # remember it! } if ( $value =~ /queue/ ) { # If at the end of LL keywords for this job step... if ( $NetKey eq FALSE ) { # if No network keyword... # Add one which uses the switch print "# @ network.MPI = sn_all,not_shared,US\n" } $NetKey = FALSE; # Reset network keyword memory } print "$value\n"; # Copy a single ll cmd file line to new cmd file}

J. Skovira 5/05 v1J. Skovira 5/05 v1 3232

Tips for Efficient Job Processing

Assumptions: One task per CPU Classes Configured

Get your job to the TOP of the queue: Short run Small number of nodes Use ip communication over the switch Priority? Submit during low use periods (evening)

These are FREE! all above tips (except priority) will impact no other job

J. Skovira 5/05 v1J. Skovira 5/05 v1 3333

More Tips for Efficient Job Processing

Allow your job to run as QUICKLY as possible:

Balance node operations

Keep data entirely in physical memory

Use processors of similar types (system admin?)

Use distributed data load and store

Profile your applications for efficient compiler use

This could be an entirely new presentation!

J. Skovira 5/05 v1J. Skovira 5/05 v1 3434

Questions and Answers