Working with Oracle/Sun Grid Engine
Click here to load reader
-
Upload
dan-morrill -
Category
Education
-
view
1.299 -
download
3
description
Transcript of Working with Oracle/Sun Grid Engine
WORKING WITH SUN/ORACLE GRID
ENGINE IN UBUNTU 12.10
CIS 210 February 2013
What is “Grid Engine”?
Sun/Oracle Grid Engine is: A quick and easy way to set up a multi-
cluster system using existing hardware Oracle Grid Engine is the most widely
deployed workload management solution in the industry and offers unmatched scalability. On top of a rich set of advanced scheduling capabilities and the flexibility to adapt to any computing environment and application workload, Oracle Grid Engine offers comprehensive support for the cloud computing model.
How to Install
Via Webappl.blogspot.com http://webappl.blogspot.com/2011/05/
install-sun-grid-engine-sge-on-ubuntu.html
Install SGE on master node: Install SGE on master node:
mpiuser@ub0:~$ sudo apt-get install gridengine-client gridengine-common gridengine-master gridengine-qmon gridengine-exec#remove gridengine-exec from the list if master node is not supposed to run jobs#during the installation, we need to set the cluster CELL name (such as ‘default’)
Install SGE on other nodes: Install SGE on other nodes: mpiuser@ub1:~$ sudo apt-get install
gridengine-client gridengine-exec
The CELL name is set the same as that of the master node
Set SGE_ROOT and SGE_CELL
Set SGE_ROOT and SGE_CELL environment variables:$SGE_ROOT refers to the installation path of SGE$SGE_CELL is cell name which is ‘default’ on our machineEdit /etc/profile and /etc/bash.bachrc, add the following two linesexport SGE_ROOT=/var/lib/gridengine #this is the path on our machinesexport SGE_CELL=defaultSource the script: source /etc/profile
Configure SGE with qmon Configure SGE with qmon (This section is modified
from a note by Junjun Mao) Invoke qmon as superuser:
mpiuser@ub0:~$ sudo qmon #On our machine, qmon failed to start due to missing
fonts ‘-adobe-helvetica-…” # To solve the fonts problem:
mpiuser@ub0:~$ sudo apt-get install xfs xfsttmpiuser@ub0:~$ sudo apt-get install t1-xfree86-nonfree ttf-xfree86-nonfree ttf-xfree86-nonfree-syriac xfonts-75dpi xfonts-100dpimpiuser@ub0:~$ sudo reboot #after reboot, the problem is gone
Configure hosts
Configure hosts "Host Configuration" -> "Administration
Host" -> Add master node and other administrative nodes"Host Configuration" -> "Submit Host" -> Add master node and other submit nodes"Host Configuration" -> "Execution Host" -> Add slave nodes->Click on "Done" to finish
Configure the user Configure the user Add or delete users that are allowed to
access SGE here. In this example, a user is added to an existing group and later this group will be allowed to submit jobs. Everything else is left as default values.
"User Configuration" -> "Userset" -> Highlight userset "arusers" and click on "Modify" -> Input user name in "User/Group" field->Click "Done" to finish
Configure the queue
Configure the queueWhile Host Configuration deals what computing resources are available and User Configuration defines who have access to the resources, this Queue Control defines ways to connect hosts and users.
Queue Control "Queue Control" -> "Hosts" -> Confirm the execution
hosts show up there.
"Queue Control" -> "Cluster Queues" -> Click on "Add" -> Name the queue, add execution nodes to Hostlist;and"Use access" -> allow access to user group arusers;"General Configuration" -> Field "Slots" -> Raise the number to total CPU cores on slave nodes (ok to use a bigger number than actual CPU cores).
"Queue Control" -> "Queue Instances" -> This is the place to manually assign hosts to queues, and control the state (active, suspend ...) of hosts.
Configure parallel environment
Configure parallel environment"Queue Control" -> "Cluster Queues" -> Select a queue that will run parallel jobs -> Click on "Modify" -> "Parallel Environment" -> Click on icon "PE" below the right and left arrows -> Click on "Add" -> Name the PE, slots = 999, start_proc_args = $SGE_ROOT/mpi/startmpi.sh $pe_hostfile, stop_proc_args = $SGE_ROOT/mpi/stopmpi.sh, allocation_rule=$fill_up, check "Control slaves" to make this variable checked.
Make sure the configured PE is loaded from "Available PE" to "Referenced PE".
Confirm and close all config windows and open "Queue Control" -> "Cluster Queues" -> "Parallel Environment" again, the named PE should show up.
Once created and linked to a queue, PE can be edited from "Queue Control" -> "PE" too.
Check whether sge hosts are running properly Check whether sge hosts are running properly
mpiuser@ub0:~$ qhost #it should list the system info from all nodesmpiuser@ub0:~$ qconf -sel #it should list the hostnames of nodesmpiuser@ub0:~$ qconf -sql #it should list the queuesmpiuser@ub0:~$ ps aux | grep sge_qmaster | grep -v grep #check master daemonmpiuser@ub0:~$ ps aux | grep sge_execd | grep -v grep #check execute daemonmpiuser@ub1:~$ ps aux | grep sge_ execd | grep -v grep #check execute daemon
#If sge_qmaster or sge_execd daemon is not running, try starting by service#mpiuser@ub1:~$ sudo service gridengine-master start#mpiuser@ub1:~$ sudo service gridengine-exec start…#Reboot node(s) if sge_qmaster or sge_execd fails to start
Run a test script Run a test script
Make a script named ‘test’ with content:#!/bin/bash### Request Bourne shell as shell for job#$ -S /bin/bash### Use current directory as working directory#$ -CWD### Name the job:#$ -N testecho “Running environment:”envecho “=============================”###end of script
Job Submission
To submit the job: qsub test#a job id returned if successfulQuery the job status: qstat#If the job is running successfully, there will be two output files produced in the current working directory with name test.oXXX (the standard output) and test.eXXX (the standard error), where test is the job name and XXX is the job id.
Always check your logs
Check log messages if error occursmpiuser@ub0:~$ less /var/spool/gridengine/qmaster/messages #master nodempiuser@ub0:~$ less /var/spool/gridengine/execd/ub0/messages #exec node
Possible Errors Question: My output file has a Warning:
no access to tty (Bad file descriptor).Thus no job control in this shell.Answer: This warning is caused if you are using the tcsh or csh as shell for submitting job. It is safe to ignore this warning. Alternatively you can qsub -S /bin/bash to run your program in different shell or add a line of ‘#$ -S /bin/bash’ in the job script.
Possible Errors Question: Master host failed to respond properly. Error message is “error: commlib
error: access denied (client IP resolved to host name ‘ub0…’. This is not identical to clients host name ‘ub0’) error: unable to contact qmaster using port 6444 on host ‘ub0’”Answer: Reboot the master node or install the SGE from source code on master node (Solutions not confirmed yet). It also could be due to that the utility of gethostname (full path is ‘/usr/lib/gridengine/gethostname’ on our machines) returns a different hostname to that from running command ‘hostname -f’. If this is the case (e.g., host having multiple network interfaces), create a file named ‘host_aliases’ under ‘$SGE_ROOT/$SGE_CELL/common’ and populate as follows,# cat host_aliasesub0 ub0.my.com ub0-gridub1 ub1.my.com ub1-gridub2 ub2.my.com ub2-gridub3 ub3.my.com ub3-gridand then restart the gridengine daemon (see man page of sge_host_aliases for details). Check the aliases:mpiuser@ub0:~$ /usr/lib/gridengine/gethostname -aname ub0-gridmpiuser@ub0:~$ /usr/lib/gridengine/gethostname -aname ub0#both of them should return ub0
Sources
http://manpages.ubuntu.com/manpages//jaunty/man5/sge_conf.5.html
http://webappl.blogspot.com/2011/05/install-sun-grid-engine-sge-on-ubuntu.html
http://pka.engr.ccny.cuny.edu/~jmao/node/49
http://webappl.blogspot.com/2011/05/setting-up-mpich2-cluster-with-ubuntu.html