Working with Oracle/Sun Grid Engine

20

Click here to load reader

description

Presentation for Highline CIS 210 working with local hadoop clusters using Oracle/Sun grid engine

Transcript of Working with Oracle/Sun Grid Engine

Page 1: Working with Oracle/Sun Grid Engine

WORKING WITH SUN/ORACLE GRID

ENGINE IN UBUNTU 12.10

CIS 210 February 2013

Page 2: Working with Oracle/Sun Grid Engine

What is “Grid Engine”?

Page 3: Working with Oracle/Sun Grid Engine

Sun/Oracle Grid Engine is: A quick and easy way to set up a multi-

cluster system using existing hardware Oracle Grid Engine is the most widely

deployed workload management solution in the industry and offers unmatched scalability. On top of a rich set of advanced scheduling capabilities and the flexibility to adapt to any computing environment and application workload, Oracle Grid Engine offers comprehensive support for the cloud computing model. 

Page 4: Working with Oracle/Sun Grid Engine

How to Install

Via Webappl.blogspot.com http://webappl.blogspot.com/2011/05/

install-sun-grid-engine-sge-on-ubuntu.html

Page 5: Working with Oracle/Sun Grid Engine

Install SGE on master node: Install SGE on master node:

mpiuser@ub0:~$ sudo apt-get install gridengine-client gridengine-common gridengine-master gridengine-qmon gridengine-exec#remove gridengine-exec from the list if master node is not supposed to run jobs#during the installation, we need to set the cluster CELL name (such as ‘default’)

Page 6: Working with Oracle/Sun Grid Engine

Install SGE on other nodes: Install SGE on other nodes: mpiuser@ub1:~$ sudo apt-get install

gridengine-client gridengine-exec

The CELL name is set the same as that of the master node

Page 7: Working with Oracle/Sun Grid Engine

Set SGE_ROOT and SGE_CELL

Set SGE_ROOT and SGE_CELL environment variables:$SGE_ROOT refers to the installation path of SGE$SGE_CELL is cell name which is ‘default’ on our machineEdit /etc/profile and /etc/bash.bachrc, add the following two linesexport SGE_ROOT=/var/lib/gridengine #this is the path on our machinesexport SGE_CELL=defaultSource the script: source /etc/profile

Page 8: Working with Oracle/Sun Grid Engine

Configure SGE with qmon Configure SGE with qmon (This section is modified

from a note by Junjun Mao) Invoke qmon as superuser:

mpiuser@ub0:~$ sudo qmon #On our machine, qmon failed to start due to missing

fonts ‘-adobe-helvetica-…” # To solve the fonts problem:

mpiuser@ub0:~$ sudo apt-get install xfs xfsttmpiuser@ub0:~$ sudo apt-get install t1-xfree86-nonfree ttf-xfree86-nonfree ttf-xfree86-nonfree-syriac xfonts-75dpi xfonts-100dpimpiuser@ub0:~$ sudo reboot #after reboot, the problem is gone

Page 9: Working with Oracle/Sun Grid Engine

Configure hosts

Configure hosts "Host Configuration" -> "Administration

Host" -> Add master node and other administrative nodes"Host Configuration" -> "Submit Host" -> Add master node and other submit nodes"Host Configuration" -> "Execution Host" -> Add slave nodes->Click on "Done" to finish

Page 10: Working with Oracle/Sun Grid Engine

Configure the user Configure the user Add or delete users that are allowed to

access SGE here. In this example, a user is added to an existing group and later this group will be allowed to submit jobs. Everything else is left as default values.

"User Configuration" -> "Userset" -> Highlight userset "arusers" and click on "Modify" -> Input user name in "User/Group" field->Click "Done" to finish

Page 11: Working with Oracle/Sun Grid Engine

Configure the queue

Configure the queueWhile Host Configuration deals what computing resources are available and User Configuration defines who have access to the resources, this Queue Control defines ways to connect hosts and users.

Page 12: Working with Oracle/Sun Grid Engine

Queue Control "Queue Control" -> "Hosts" -> Confirm the execution

hosts show up there.

"Queue Control" -> "Cluster Queues" -> Click on "Add" -> Name the queue, add execution nodes to Hostlist;and"Use access" -> allow access to user group arusers;"General Configuration" -> Field "Slots" -> Raise the number to total CPU cores on slave nodes (ok to use a bigger number than actual CPU cores).

"Queue Control" -> "Queue Instances" -> This is the place to manually assign hosts to queues, and control the state (active, suspend ...) of hosts.

Page 13: Working with Oracle/Sun Grid Engine

Configure parallel environment

Configure parallel environment"Queue Control" -> "Cluster Queues" -> Select a queue that will run parallel jobs -> Click on "Modify" -> "Parallel Environment" -> Click on icon "PE" below the right and left arrows -> Click on "Add" -> Name the PE, slots = 999, start_proc_args = $SGE_ROOT/mpi/startmpi.sh $pe_hostfile, stop_proc_args = $SGE_ROOT/mpi/stopmpi.sh, allocation_rule=$fill_up, check "Control slaves" to make this variable checked.

Make sure the configured PE is loaded from "Available PE" to "Referenced PE".

Confirm and close all config windows and open "Queue Control" -> "Cluster Queues" -> "Parallel Environment" again, the named PE should show up.

Once created and linked to a queue, PE can be edited from "Queue Control" -> "PE" too.

Page 14: Working with Oracle/Sun Grid Engine

Check whether sge hosts are running properly Check whether sge hosts are running properly

mpiuser@ub0:~$ qhost #it should list the system info from all nodesmpiuser@ub0:~$ qconf -sel #it should list the hostnames of nodesmpiuser@ub0:~$ qconf -sql #it should list the queuesmpiuser@ub0:~$ ps aux | grep sge_qmaster | grep -v grep #check master daemonmpiuser@ub0:~$ ps aux | grep sge_execd | grep -v grep #check execute daemonmpiuser@ub1:~$ ps aux | grep sge_ execd | grep -v grep #check execute daemon

#If sge_qmaster or sge_execd daemon is not running, try starting by service#mpiuser@ub1:~$ sudo service gridengine-master start#mpiuser@ub1:~$ sudo service gridengine-exec start…#Reboot node(s) if sge_qmaster or sge_execd fails to start

Page 15: Working with Oracle/Sun Grid Engine

Run a test script Run a test script

Make a script named ‘test’ with content:#!/bin/bash### Request Bourne shell as shell for job#$ -S /bin/bash### Use current directory as working directory#$ -CWD### Name the job:#$ -N testecho “Running environment:”envecho “=============================”###end of script

Page 16: Working with Oracle/Sun Grid Engine

Job Submission

To submit the job: qsub test#a job id returned if successfulQuery the job status: qstat#If the job is running successfully, there will be two output files produced in the current working directory with name test.oXXX (the standard output) and test.eXXX (the standard error), where test is the job name and XXX is the job id.

Page 17: Working with Oracle/Sun Grid Engine

Always check your logs

Check log messages if error occursmpiuser@ub0:~$ less /var/spool/gridengine/qmaster/messages #master nodempiuser@ub0:~$ less /var/spool/gridengine/execd/ub0/messages #exec node

Page 18: Working with Oracle/Sun Grid Engine

Possible Errors Question: My output file has a Warning:

no access to tty (Bad file descriptor).Thus no job control in this shell.Answer: This warning is caused if you are using the tcsh or csh as shell for submitting job. It is safe to ignore this warning. Alternatively you can qsub -S /bin/bash to run your program in different shell or add a line of ‘#$ -S /bin/bash’ in the job script.

Page 19: Working with Oracle/Sun Grid Engine

Possible Errors Question: Master host failed to respond properly. Error message is “error: commlib

error: access denied (client IP resolved to host name ‘ub0…’. This is not identical to clients host name ‘ub0’) error: unable to contact qmaster using port 6444 on host ‘ub0’”Answer: Reboot the master node or install the SGE from source code on master node (Solutions not confirmed yet). It also could be due to that the utility of gethostname (full path is ‘/usr/lib/gridengine/gethostname’ on our machines) returns a different hostname to that from running command ‘hostname -f’. If this is the case (e.g., host having multiple network interfaces), create a file named ‘host_aliases’ under ‘$SGE_ROOT/$SGE_CELL/common’ and populate as follows,# cat host_aliasesub0 ub0.my.com ub0-gridub1 ub1.my.com ub1-gridub2 ub2.my.com ub2-gridub3 ub3.my.com ub3-gridand then restart the gridengine daemon (see man page of sge_host_aliases for details). Check the aliases:mpiuser@ub0:~$ /usr/lib/gridengine/gethostname -aname ub0-gridmpiuser@ub0:~$ /usr/lib/gridengine/gethostname -aname ub0#both of them should return ub0