© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene...
-
Upload
rachael-ranshaw -
Category
Documents
-
view
215 -
download
2
Transcript of © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P LoadLeveler Blue Gene...
© 2007 IBM Corporation
IBM Global Engineering Solutions
IBM Blue Gene/P
LoadLeveler Blue Gene Support
Enci ZhongLoadLeveler Development
IBM Blue Gene/P System Administration
Interaction with Blue Gene
Blue Gene
Jobs
Blue Gene Bridge API
GetResourcesAnd JobsData
Find ResourceFor jobsAndDefinePartitions
Blue Gene mpirun
submitted Run a job
LoadLeveler
IBM Blue Gene/P System Administration
LoadLeveler Daemons
Service NodeFront End Node
MasterLoadL_master
Central ManagerLoadL_negotiator
MasterLoadL_master
ScheddLoadL_schedd
Startd/StarterLoadL_startdLoadL_starter
Jobs
Blue Gene mpirun
Blue Gene Bridge API
IBM Blue Gene/P System Administration
LoadLeveler Configuration
Service NodeFront End Node
/etc/LoadL.cfg/etc/LoadL.cfg
LoadL_config
LoadL_admin
LoadL_config.local LoadL_config.local
IBM Blue Gene/P System Administration
LoadL_config
SCHEDULER_TYPE = BACKFILLNEGOTIATOR_CYCLE_DELAY = 10VM_IMAGE_ALGORITHM = FREE_PAGING_SPACE_PLUS_FREE_REAL_MEMORY
BG_ENABLED = true
BG_CACHE_PARTITIONS = true
BG_MIN_PARTITION_SIZE = 32
CM_CHECK_USERID = false
BG_ALLOW_LL_JOBS_ONLY = false
IBM Blue Gene/P System Administration
LoadL_admin
<mySN> : type = machine central_manager = true
<myFEN> : type = machine central_manager = false schedd_host = true # Allow jobs be submitted from the SN
small: type = class include_bg = R00-M0
row1: type = class include_bg = R1
medium: type = class exclude_bg = R0 R1
IBM Blue Gene/P System Administration
LoadL_config.local
Service Node
Front End Node
START_DAEMONS = TRUESCHEDD_RUNS_HERE = TrueSTARTD_RUNS_HERE = TrueMAX_STARTERS = 60CLASS = small(10) row1(20) medium(30) large(10)
START_DAEMONS = TRUESCHEDD_RUNS_HERE = FALSESTARTD_RUNS_HERE = FALSE
Note: mpirun is run on the FEN and it doesn’t use a lot of resources and thus many mpirun processes can share the same FEN.
IBM Blue Gene/P System Administration
Before Starting LoadLeveler on Blue Gene/P
Standalone mpirun must workAdd userid loadl to the bgpadmin group /usr/lib64/libdb2.so must exist In the login profile of userid loadl, add
export BRIDGE_CONFIG_FILE=/bgsys/drivers/ppcfloor/bin/bridge.config
export DB_PROPERTY=/bgsys/drivers/ppcfloor/bin/db.properties.tpl
The two files or their local copy must be readable by userid loadl
Note: LoadLeveler need to be restarted after Blue Gene driver or database updates, etc.
IBM Blue Gene/P System Administration
Starting LoadLeveler
llctl start on both the FEN and SN llstatus look for “Blue Gene is present” llstatus -b Name Base Partitions c-nodes InQ Run BGP 4x4x2 32x32x16 0 0 llstatus –B all show all base partitions llstatus –P <partition_name> llstatus –b –l show more BG resources
IBM Blue Gene/P System Administration
LoadLeveler Job Command File
# @ job_name = myjob# @ comment = "BG Job by Size"# @ error = $(home)/output/$(job_name).$(jobid).err# @ output = $(home)/output/$(job_name).$(jobid).out# @ environment = COPY_ALL;# @ wall_clock_limit = 00:20:00# @ notification = error# @ notify_user = $(user)@us.ibm.com# @ job_type = bluegene# @ bg_size = 32# @ queue/usr/bin/mpirun -exe /bgtest/hello.rts -verbose 1
IBM Blue Gene/P System Administration
Blue Gene Job Keywords
Mutually exclusive (one must be specified) bg_size number of compute nodes
bg_shape 1x2x4 number of BPs in x,y,z direction
bg_partition specify a predefined partition Optional
bg_connection MESH, TORUS, PREFER_TORUS
bg_rotate True or False bg_requirements c-node memory
IBM Blue Gene/P System Administration
Submit a Job
llsubmit <my_job_command_file> llq llq –b show Blue Gene specific info llq –s <step_id> show why the job step
remains idle
IBM Blue Gene/P System Administration
Partition Size and I/O Nodes I/O Nodes/BP = 4, partition size >= 128 I/O Nodes/BP = 8, partition size >= 64/128 I/O Nodes/BP = 16, partition size >= 32 I/O Nodes/BP = 32, partition size >= 16/32Only Blue Gene/P allows partition sizes 16, 64
and 256LoadLeveler defined partition size can not be
smaller than BG_MIN_PARTITION_SIZE 1 Rack has two Base Partitions (BP)
IBM Blue Gene/P System Administration
Mixed I/O Nodes Ratio
One rack has 16 I/O Nodes/BPOther racks have 4 I/O Nodes/BPA job asks for 32 compute nodes will only be run
on the rack with 16 I/O Nodes/BPA job asks for 128 compute nodes can be run on
any rackBG_MIN_PARTITION_SIZE=16 32 actualBG_MIN_PARTITION_SIZE=128 128 actual
IBM Blue Gene/P System Administration
Unconnected I/O Nodes
Each BP has 16 I/O Nodes (ION)One rack has all 16 IONs/BP connectedOther racks has only 4 of them connectedMust set
max_psets_per_bp=4 in db.properties file BG_MIN_PARTITION_SIZE=128
Dynamically created partitions only use 4 IONs per BP
Predefined partitions (through mmcs_db_console or the navigator) can use more IONs and be smaller
IBM Blue Gene/P System Administration
Advance Reservation
In LoadL_admin, add loadl: type = user
max_reservations = 10 llmkres –t 14:00 –d 300 –c 1024 llmkres –t 12/18 08:00 –d 60 –f my_jcf In LoadL_config, can add
MAX_RESERVATIONS = 20 (default 10)
IBM Blue Gene/P System Administration
Advance Reservation
Reserve for maintenanceReserve for special workloadAllow other users or groups to useAllow a reservation be automatically cancelled
if no more jobs can runAllow extra resources to be shared when all
special jobs for the reservation start to run
IBM Blue Gene/P System Administration
Advance Reservation
More resources are needed by TORUS than by MESH
Reservation made through bg_partition reserves exactly the same resources as the predefined partition
Reservation made through bg_size or bg_shape can reserves more resources to allow smaller jobs to run inside the reservation
IBM Blue Gene/P System Administration
Fair Share Scheduling
Share resources “fairly” according to resource entitlement and usage
In LoadL_config, specifyFAIR_SHARE_TOTAL_SHARES = 1000FAIR_SHARE_INTERVAL = 720
llfs to show shares allocated and used llfs –s/-r <file>/-r to save/restore/reset the fair
share data
IBM Blue Gene/P System Administration
Fair Share Scheduling
It’s all about job priority!SYSPRIO must be specified to enable Fair
Share Scheduling very flexibleNEGOTIATOR_RECALCULATE_SYSPRIO_I
NTERVAL must be positive In LoadL_admin, specify fair_shares values for
some or all users/groupsAll users can run jobs even if fair_shares=0
IBM Blue Gene/P System Administration
A Mixed LoadLeveler Cluster
A Blue Gene system can be in the same cluster with other AIX or Linux machines
The Central Manager must be run on the service node of the Blue Gene system
Only one Blue Gene system can be in a LoadLeveler cluster
Job classes can be used to separate Blue Gene FENs, Linux and AIX machines
End users can submit all jobs the same way
IBM Blue Gene/P System Administration
Multicluster Support
In LoadL_admin, add Multicluster definitions
################################# MULTICLUSTER DEFINITIONS #################################BGL: type = cluster outbound_hosts = bglfen3 inbound_hosts = bglfen3 local = true
BGP1: type = cluster outbound_hosts = dd1sys1fen1 inbound_hosts = dd1sys1fen1
BGP2: type = cluster outbound_hosts = dd2sys1fen2 inbound_hosts = dd2sys1fen2 Three separate clusters forms a Multicluster environment
IBM Blue Gene/P System Administration
Multicluster Support
From one cluster, a user can submit jobs to any other clusterllsubmit –X BGP1 my_job_command_file
From one cluster, a user can query jobs in any other clusterllq –X BGP2
IBM Blue Gene/P System Administration
Runtime Environment Available to Prologs and Epilogs
In LoadL_config, add JOB_PROLOG = /bgtest/bg_job_prolog.sh#!/bin/kshname=`basename $0 .sh`echo "$LOADL_BG_PARTITION $LOADL_BG_SIZE
$LOADL_BG_CONNECTION $LOADL_BG_BPS $LOADL_BG_IONODES `date` $LOADL_STEP_OWNER $LOADL_STEP_ID $LOADL_STEP_CLASS " > /tmp/$name.$LOADL_STEP_ID.log
cat /tmp/bg_job_prolog.bgpdd1sys1.rchland.ibm.com.2.0.logLL07111910011602 512 MESH R20-M1 N00-J00,N04-J00,N08-
J00,N12-J00 Mon Nov 19 10:01:16 CST 2007 ezhong bgpdd1sys1.rchland.ibm.com.2.0 high
IBM Blue Gene/P System Administration
Blue Gene Job Info from llq# llq
Id Owner Submitted ST PRI Class Running On ------------------------ ---------- ----------- -- --- ------------ -----------bgpdd1sys1.9.0 ezhong 11/21 10:29 R 50 high bgpdd1sys1
1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted
# llq –b
Id Owner Submitted LL BG PT Partition Size ________________________ __________ ___________ __ __ __ ________________
______bgpdd1sys1.9.0 ezhong 11/21 10:29 R FR LL07112110294409 512
1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted
# llq -f %id %BB %BS %PT %BG %dd %st
Step Id Partition Size PT BG Disp. Date ST------------------------ ---------------- ------ -- -- ----------- --bgpdd1sys1.9.0 LL07112110294409 512 FR 11/21 10:29 R
1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted
IBM Blue Gene/P System Administration
Blue Gene Job Info from llq
# llq –l
=============== Job Step bgpdd1sys1.rchland.ibm.com.9.0 ===============... Step Type: Blue Gene Size Requested: 512 Size Allocated: 512 Shape Requested: Shape Allocated: 1x1x1 Wiring Requested: MESH Wiring Allocated: MESH Rotate: True Blue Gene Status: Blue Gene Job Id: Partition Requested: Partition Allocated: LL07112110294409 BG Partition State: FREE BG Requirements: ...
IBM Blue Gene/P System Administration
Multiple Top Dogs
Resources are reserved for highest priority jobs (top dogs) during a dispatching cycle that other jobs are backfilled around them.
In LoadL_config, set MAX_TOP_DOGS = <number> In LoadL_admin, set max_top_dogs = <number>Default number of top dogs is 1.
IBM Blue Gene/P System Administration
Top Dog Query
A sample Data Access API program /opt/ibmll/LoadL/full/samples/lldata_access/topdog.c
> make/usr/bin/g++ -m64 -g -I. -I/opt/ibmll/LoadL/full/include -c -o topdog.o topdog.c/usr/bin/g++ -m64 -g -I. -I/opt/ibmll/LoadL/full/include -o topdog topdog.o -m64 -L. -
L/usr/lib64 -lllapi -lpthread –ldl
> ./topdogStep Owner q_sysprio Estimated Start Time------------------------------ ---------- ----------
------------------------bgpsys6.rchland.ibm.com.56.0 ezhong 50000 Thu Jun 21 17:50:32
2007bgpsys6.rchland.ibm.com.56.1 ezhong 50000 Thu Jun 21 18:00:19
2007bgpsys6.rchland.ibm.com.55.2 ezhong 50000 Thu Jun 21 17:50:19
2007bgpsys6.rchland.ibm.com.55.3 ezhong 50000 Thu Jun 21 17:50:32
2007===== The top dogs were considered for scheduling at Thu Jun 21 17:40:43 2007
IBM Blue Gene/P System Administration
More about job priority
q_sysprio in the llq –l output is used by LoadLeveler Central Manger for scheduling
Set in LoadL_config SYSPRIO_THRESHOLD_TO_IGNORE_STEP = integer
Jobs with lower q_sysprio won’t be scheduled to run llmodify –s <q_sysprio> <step_id> -- Admin only
command option Assign a fixed priority, won’t be changed by priority
recalculation
IBM Blue Gene/P System Administration
LoadLeveler Download Sites
For the initial download (including the license information) https://www14.software.ibm.com/webapp/iwm/web/preLogin.do?
source=BGL-BLUEGENE https://www14.software.ibm.com/webapp/iwm/web/preLogin.do?
source=BGP-BLUEGENEP Those pages are password protected.
For the updates http://www14.software.ibm.com/webapp/set2/sas/f/lodleveler/home.html open for everyone.
For LoadLeveler documentation http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/
com.ibm.cluster.infocenter.doc/library.html
IBM Blue Gene/P System Administration
Installing LoadLeveler for Blue Gene/P
File sets needed IBMJava2-142-ppc64-JRE-1.4.2-5.0.ppc64.rpm
LoadL-full-license-SLES10-PPC64-3.4.2.1-0.ppc64.rpm
LoadL-full-SLES10-PPC64-3.4.2.1-0.ppc64.rpm
From the directory with the filesets: rpm -ihv LoadL-full-license-SLES10-PPC64-3.4.2.1-0.ppc64.rpm
/opt/ibmll/LoadL/sbin/install_ll -y -d .
IBM Blue Gene/P System Administration
Installing LoadLeveler for Blue Gene/L
Please see Chapter 10 of the IBM Redbook: “IBM System Blue Gene Solution: Configuring and Maintaining Your Environment” http://www.redbooks.ibm.com/abstracts/sg247352.html