© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene...

32
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development

Transcript of © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene...

Page 1: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

© 2007 IBM Corporation

IBM Global Engineering Solutions

IBM Blue Gene/P

LoadLeveler Blue Gene Support

Enci ZhongLoadLeveler Development

Page 2: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Interaction with Blue Gene

Blue Gene

Jobs

Blue Gene Bridge API

GetResourcesAnd JobsData

Find ResourceFor jobsAndDefinePartitions

Blue Gene mpirun

submitted Run a job

LoadLeveler

Page 3: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

LoadLeveler Daemons

Service NodeFront End Node

MasterLoadL_master

Central ManagerLoadL_negotiator

MasterLoadL_master

ScheddLoadL_schedd

Startd/StarterLoadL_startdLoadL_starter

Jobs

Blue Gene mpirun

Blue Gene Bridge API

Page 4: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

LoadLeveler Configuration

Service NodeFront End Node

/etc/LoadL.cfg/etc/LoadL.cfg

LoadL_config

LoadL_admin

LoadL_config.local LoadL_config.local

Page 5: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

LoadL_config

SCHEDULER_TYPE = BACKFILLNEGOTIATOR_CYCLE_DELAY = 10VM_IMAGE_ALGORITHM = FREE_PAGING_SPACE_PLUS_FREE_REAL_MEMORY

BG_ENABLED = true

BG_CACHE_PARTITIONS = true

BG_MIN_PARTITION_SIZE = 32

CM_CHECK_USERID = false

BG_ALLOW_LL_JOBS_ONLY = false

Page 6: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

LoadL_admin

<mySN> : type = machine central_manager = true

<myFEN> : type = machine central_manager = false schedd_host = true # Allow jobs be submitted from the SN

small: type = class include_bg = R00-M0

row1: type = class include_bg = R1

medium: type = class exclude_bg = R0 R1

Page 7: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

LoadL_config.local

Service Node

Front End Node

START_DAEMONS = TRUESCHEDD_RUNS_HERE = TrueSTARTD_RUNS_HERE = TrueMAX_STARTERS = 60CLASS = small(10) row1(20) medium(30) large(10)

START_DAEMONS = TRUESCHEDD_RUNS_HERE = FALSESTARTD_RUNS_HERE = FALSE

Note: mpirun is run on the FEN and it doesn’t use a lot of resources and thus many mpirun processes can share the same FEN.

Page 8: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Before Starting LoadLeveler on Blue Gene/P

Standalone mpirun must workAdd userid loadl to the bgpadmin group /usr/lib64/libdb2.so must exist In the login profile of userid loadl, add

export BRIDGE_CONFIG_FILE=/bgsys/drivers/ppcfloor/bin/bridge.config

export DB_PROPERTY=/bgsys/drivers/ppcfloor/bin/db.properties.tpl

The two files or their local copy must be readable by userid loadl

Note: LoadLeveler need to be restarted after Blue Gene driver or database updates, etc.

Page 9: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Starting LoadLeveler

llctl start on both the FEN and SN llstatus look for “Blue Gene is present” llstatus -b Name Base Partitions c-nodes InQ Run BGP 4x4x2 32x32x16 0 0 llstatus –B all show all base partitions llstatus –P <partition_name> llstatus –b –l show more BG resources

Page 10: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

LoadLeveler Job Command File

# @ job_name = myjob# @ comment = "BG Job by Size"# @ error = $(home)/output/$(job_name).$(jobid).err# @ output = $(home)/output/$(job_name).$(jobid).out# @ environment = COPY_ALL;# @ wall_clock_limit = 00:20:00# @ notification = error# @ notify_user = $(user)@us.ibm.com# @ job_type = bluegene# @ bg_size = 32# @ queue/usr/bin/mpirun -exe /bgtest/hello.rts -verbose 1

Page 11: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Blue Gene Job Keywords

Mutually exclusive (one must be specified) bg_size number of compute nodes

bg_shape 1x2x4 number of BPs in x,y,z direction

bg_partition specify a predefined partition Optional

bg_connection MESH, TORUS, PREFER_TORUS

bg_rotate True or False bg_requirements c-node memory

Page 12: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Submit a Job

llsubmit <my_job_command_file> llq llq –b show Blue Gene specific info llq –s <step_id> show why the job step

remains idle

Page 13: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Partition Size and I/O Nodes I/O Nodes/BP = 4, partition size >= 128 I/O Nodes/BP = 8, partition size >= 64/128 I/O Nodes/BP = 16, partition size >= 32 I/O Nodes/BP = 32, partition size >= 16/32Only Blue Gene/P allows partition sizes 16, 64

and 256LoadLeveler defined partition size can not be

smaller than BG_MIN_PARTITION_SIZE 1 Rack has two Base Partitions (BP)

Page 14: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Mixed I/O Nodes Ratio

One rack has 16 I/O Nodes/BPOther racks have 4 I/O Nodes/BPA job asks for 32 compute nodes will only be run

on the rack with 16 I/O Nodes/BPA job asks for 128 compute nodes can be run on

any rackBG_MIN_PARTITION_SIZE=16 32 actualBG_MIN_PARTITION_SIZE=128 128 actual

Page 15: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Unconnected I/O Nodes

Each BP has 16 I/O Nodes (ION)One rack has all 16 IONs/BP connectedOther racks has only 4 of them connectedMust set

max_psets_per_bp=4 in db.properties file BG_MIN_PARTITION_SIZE=128

Dynamically created partitions only use 4 IONs per BP

Predefined partitions (through mmcs_db_console or the navigator) can use more IONs and be smaller

Page 16: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Advance Reservation

In LoadL_admin, add loadl: type = user

max_reservations = 10 llmkres –t 14:00 –d 300 –c 1024 llmkres –t 12/18 08:00 –d 60 –f my_jcf In LoadL_config, can add

MAX_RESERVATIONS = 20 (default 10)

Page 17: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Advance Reservation

Reserve for maintenanceReserve for special workloadAllow other users or groups to useAllow a reservation be automatically cancelled

if no more jobs can runAllow extra resources to be shared when all

special jobs for the reservation start to run

Page 18: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Advance Reservation

More resources are needed by TORUS than by MESH

Reservation made through bg_partition reserves exactly the same resources as the predefined partition

Reservation made through bg_size or bg_shape can reserves more resources to allow smaller jobs to run inside the reservation

Page 19: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Fair Share Scheduling

Share resources “fairly” according to resource entitlement and usage

In LoadL_config, specifyFAIR_SHARE_TOTAL_SHARES = 1000FAIR_SHARE_INTERVAL = 720

llfs to show shares allocated and used llfs –s/-r <file>/-r to save/restore/reset the fair

share data

Page 20: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Fair Share Scheduling

It’s all about job priority!SYSPRIO must be specified to enable Fair

Share Scheduling very flexibleNEGOTIATOR_RECALCULATE_SYSPRIO_I

NTERVAL must be positive In LoadL_admin, specify fair_shares values for

some or all users/groupsAll users can run jobs even if fair_shares=0

Page 21: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

A Mixed LoadLeveler Cluster

A Blue Gene system can be in the same cluster with other AIX or Linux machines

The Central Manager must be run on the service node of the Blue Gene system

Only one Blue Gene system can be in a LoadLeveler cluster

Job classes can be used to separate Blue Gene FENs, Linux and AIX machines

End users can submit all jobs the same way

Page 22: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Multicluster Support

In LoadL_admin, add Multicluster definitions

################################# MULTICLUSTER DEFINITIONS #################################BGL: type = cluster outbound_hosts = bglfen3 inbound_hosts = bglfen3 local = true

BGP1: type = cluster outbound_hosts = dd1sys1fen1 inbound_hosts = dd1sys1fen1

BGP2: type = cluster outbound_hosts = dd2sys1fen2 inbound_hosts = dd2sys1fen2 Three separate clusters forms a Multicluster environment

Page 23: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Multicluster Support

From one cluster, a user can submit jobs to any other clusterllsubmit –X BGP1 my_job_command_file

From one cluster, a user can query jobs in any other clusterllq –X BGP2

Page 24: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Runtime Environment Available to Prologs and Epilogs

In LoadL_config, add JOB_PROLOG = /bgtest/bg_job_prolog.sh#!/bin/kshname=`basename $0 .sh`echo "$LOADL_BG_PARTITION $LOADL_BG_SIZE

$LOADL_BG_CONNECTION $LOADL_BG_BPS $LOADL_BG_IONODES `date` $LOADL_STEP_OWNER $LOADL_STEP_ID $LOADL_STEP_CLASS " > /tmp/$name.$LOADL_STEP_ID.log

cat /tmp/bg_job_prolog.bgpdd1sys1.rchland.ibm.com.2.0.logLL07111910011602 512 MESH R20-M1 N00-J00,N04-J00,N08-

J00,N12-J00 Mon Nov 19 10:01:16 CST 2007 ezhong bgpdd1sys1.rchland.ibm.com.2.0 high

Page 25: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Blue Gene Job Info from llq# llq

Id Owner Submitted ST PRI Class Running On ------------------------ ---------- ----------- -- --- ------------ -----------bgpdd1sys1.9.0 ezhong 11/21 10:29 R 50 high bgpdd1sys1

1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted

# llq –b

Id Owner Submitted LL BG PT Partition Size ________________________ __________ ___________ __ __ __ ________________

______bgpdd1sys1.9.0 ezhong 11/21 10:29 R FR LL07112110294409 512

1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted

# llq -f %id %BB %BS %PT %BG %dd %st

Step Id Partition Size PT BG Disp. Date ST------------------------ ---------------- ------ -- -- ----------- --bgpdd1sys1.9.0 LL07112110294409 512 FR 11/21 10:29 R

1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted

Page 26: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Blue Gene Job Info from llq

# llq –l

=============== Job Step bgpdd1sys1.rchland.ibm.com.9.0 ===============... Step Type: Blue Gene Size Requested: 512 Size Allocated: 512 Shape Requested: Shape Allocated: 1x1x1 Wiring Requested: MESH Wiring Allocated: MESH Rotate: True Blue Gene Status: Blue Gene Job Id: Partition Requested: Partition Allocated: LL07112110294409 BG Partition State: FREE BG Requirements: ...

Page 27: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Multiple Top Dogs

Resources are reserved for highest priority jobs (top dogs) during a dispatching cycle that other jobs are backfilled around them.

In LoadL_config, set MAX_TOP_DOGS = <number> In LoadL_admin, set max_top_dogs = <number>Default number of top dogs is 1.

Page 28: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Top Dog Query

A sample Data Access API program /opt/ibmll/LoadL/full/samples/lldata_access/topdog.c

> make/usr/bin/g++ -m64 -g -I. -I/opt/ibmll/LoadL/full/include -c -o topdog.o topdog.c/usr/bin/g++ -m64 -g -I. -I/opt/ibmll/LoadL/full/include -o topdog topdog.o -m64 -L. -

L/usr/lib64 -lllapi -lpthread –ldl

> ./topdogStep Owner q_sysprio Estimated Start Time------------------------------ ---------- ----------

------------------------bgpsys6.rchland.ibm.com.56.0 ezhong 50000 Thu Jun 21 17:50:32

2007bgpsys6.rchland.ibm.com.56.1 ezhong 50000 Thu Jun 21 18:00:19

2007bgpsys6.rchland.ibm.com.55.2 ezhong 50000 Thu Jun 21 17:50:19

2007bgpsys6.rchland.ibm.com.55.3 ezhong 50000 Thu Jun 21 17:50:32

2007===== The top dogs were considered for scheduling at Thu Jun 21 17:40:43 2007

Page 29: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

More about job priority

q_sysprio in the llq –l output is used by LoadLeveler Central Manger for scheduling

Set in LoadL_config SYSPRIO_THRESHOLD_TO_IGNORE_STEP = integer

Jobs with lower q_sysprio won’t be scheduled to run llmodify –s <q_sysprio> <step_id> -- Admin only

command option Assign a fixed priority, won’t be changed by priority

recalculation

Page 30: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

LoadLeveler Download Sites

For the initial download (including the license information) https://www14.software.ibm.com/webapp/iwm/web/preLogin.do?

source=BGL-BLUEGENE https://www14.software.ibm.com/webapp/iwm/web/preLogin.do?

source=BGP-BLUEGENEP Those pages are password protected.

For the updates http://www14.software.ibm.com/webapp/set2/sas/f/lodleveler/home.html open for everyone.

For LoadLeveler documentation http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/

com.ibm.cluster.infocenter.doc/library.html

Page 31: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Installing LoadLeveler for Blue Gene/P

File sets needed IBMJava2-142-ppc64-JRE-1.4.2-5.0.ppc64.rpm

LoadL-full-license-SLES10-PPC64-3.4.2.1-0.ppc64.rpm

LoadL-full-SLES10-PPC64-3.4.2.1-0.ppc64.rpm

From the directory with the filesets: rpm -ihv LoadL-full-license-SLES10-PPC64-3.4.2.1-0.ppc64.rpm

/opt/ibmll/LoadL/sbin/install_ll -y -d .

Page 32: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development.

IBM Blue Gene/P System Administration

Installing LoadLeveler for Blue Gene/L

Please see Chapter 10 of the IBM Redbook: “IBM System Blue Gene Solution: Configuring and Maintaining Your Environment” http://www.redbooks.ibm.com/abstracts/sg247352.html