Introduction to Linux Clusters Clare Din SAS Computing University of Pennsylvania March 15, 2004.

Introduction toIntroduction toLinux ClustersLinux Clusters

Clare DinClare Din

SAS ComputingSAS Computing

University of PennsylvaniaUniversity of Pennsylvania

March 15, 2004March 15, 2004

Cluster ComponentsCluster Components

HardwareHardware NodesNodes Disk arrayDisk array Networking gearNetworking gear Backup deviceBackup device Admin front endAdmin front end UPSUPS Rack unitsRack units

SoftwareSoftware Operating systemOperating system MPIMPI CompilersCompilers SchedulerScheduler


HardwareHardware NodesNodes

Compute nodesCompute nodes Admin nodeAdmin node I/O nodeI/O node Login nodeLogin node

Disk arrayDisk array Networking gearNetworking gear Backup deviceBackup device Admin front endAdmin front end

SoftwareSoftware Operating systemOperating system CompilersCompilers SchedulerScheduler MPIMPI


HardwareHardware Disk arrayDisk array

RAID5RAID5 SCSI 320SCSI 320 10k+ RPM, TB+ 10k+ RPM, TB+

capacitycapacity NFS-mounted from I/O NFS-mounted from I/O

nodenode Networking gearNetworking gear Backup deviceBackup device Admin front endAdmin front end



HardwareHardware Networking gearNetworking gear

Myrinet, gigE, 10/100Myrinet, gigE, 10/100 SwitchesSwitches CablesCables Networking cardsNetworking cards

Backup deviceBackup device Admin front endAdmin front end UPSUPS Rack unitsRack units



HardwareHardware Backup deviceBackup device

AIT3, DLT, LTOAIT3, DLT, LTO N-slot cartridge driveN-slot cartridge drive SANSAN

Admin front endAdmin front end UPSUPS Rack unitsRack units



HardwareHardware Admin front endAdmin front end

Console (keyboard, Console (keyboard, monitor, mouse)monitor, mouse)

KVM switchesKVM switches KVM cablesKVM cables

UPSUPS Rack unitsRack units


HardwareHardware UPSUPS

APC SmartUPS 3000APC SmartUPS 3000 3 per 42U rack3 per 42U rack

Rack unitsRack units



HardwareHardware Rack unitsRack units

42U, standard or deep42U, standard or deep



SoftwareSoftware Operating systemOperating system

Red Hat 9+ LinuxRed Hat 9+ Linux Debian LinuxDebian Linux SUSE LinuxSUSE Linux Mandrake LinuxMandrake Linux FreeBSD and othersFreeBSD and others

MPIMPI CompilersCompilers SchedulerScheduler


SoftwareSoftware MPIMPI

MPICHMPICH LAM/MPILAM/MPI MPI-GMMPI-GM MPI ProMPI Pro

CompilersCompilers SchedulerScheduler


SoftwareSoftware CompilersCompilers

gnugnu Portland GroupPortland Group IntelIntel

SchedulerScheduler


SoftwareSoftware SchedulerScheduler

OpenPBSOpenPBS PBS ProPBS Pro MauiMaui

Filesystem RequirementsFilesystem Requirements

Journalled filesystemJournalled filesystem Reboots happen more quickly after a Reboots happen more quickly after a

crashcrash Slight performance hit for this featureSlight performance hit for this feature ext3 is a popular choice (old ext2 was ext3 is a popular choice (old ext2 was

not journalled)not journalled)

Space and Power Space and Power RequirementsRequirements

SpaceSpace Standard 42U rack is Standard 42U rack is

about 24”W x 80”H x about 24”W x 80”H x 40”D40”D

Blade units give you Blade units give you more than 1 node per more than 1 node per 1U space in a deeper 1U space in a deeper rackrack

Cable management Cable management inside the rackinside the rack

Consider overhead or Consider overhead or raised floor cabling for raised floor cabling for the external cablesthe external cables

PowerPower 67 node Xeon cluster 67 node Xeon cluster

consumes 19,872W = consumes 19,872W = 5.65 tons of A/C to 5.65 tons of A/C to keep it coolkeep it cool

Ideally, each UPS plug Ideally, each UPS plug should connect to its should connect to its own circuitown circuit

Clusters (especially Clusters (especially blades) run real hot; blades) run real hot; make sure there is make sure there is adequate A/C and adequate A/C and ventilationventilation

Network RequirementsNetwork Requirements

External NetworkExternal Network One 10mbps One 10mbps

network line is network line is adequate (all adequate (all computation and computation and message passing message passing is within the is within the cluster)cluster)

Internal NetworkInternal Network gigEgigE MyrinetMyrinet Some comboSome combo

Base your net gear Base your net gear selection on selection on whether most of whether most of your jobs are CPU-your jobs are CPU-bound or I/O boundbound or I/O bound

Network Choices ComparedNetwork Choices Compared

Fast Ethernet (100BT)Fast Ethernet (100BT) 0.1 Gb/s (or 100 Mb/s) bandwidth0.1 Gb/s (or 100 Mb/s) bandwidth Essentially freeEssentially free

gigEgigE 0.4 Gb/s to 0.64 Gb/s bandwidth0.4 Gb/s to 0.64 Gb/s bandwidth ~$400 per node~$400 per node

MyrinetMyrinet 1.2 Gb/s to 2.0 Gb/s bandwidth1.2 Gb/s to 2.0 Gb/s bandwidth ~$1000 per node~$1000 per node Scales to thousands of nodesScales to thousands of nodes Buy fiber instead of copper cablesBuy fiber instead of copper cables

Networking Gear Speeds

0

500

1000

1500

2000

2500

Fast Ethernet gigE Myrinet

I/O NodeI/O Node

Globally accessible filesystem (RAID5 disk Globally accessible filesystem (RAID5 disk array)array)

Backup deviceBackup device

I/O NodeI/O Node

Globally accessible filesystem (RAID5 disk Globally accessible filesystem (RAID5 disk array)array) NFS share itNFS share it Put user home directories, apps, and scratch Put user home directories, apps, and scratch

space directories on it so all compute nodes space directories on it so all compute nodes can access themcan access them

Enforce quotas on home directoriesEnforce quotas on home directories

Backup deviceBackup device

I/O NodeI/O Node

Globally accessible filesystem (RAID5 disk Globally accessible filesystem (RAID5 disk array)array)

Backup deviceBackup device Make sure your device and software is Make sure your device and software is

compatible with your operating systemcompatible with your operating system Plan a good backup strategyPlan a good backup strategy Test the ETA of bringing back a single file or a Test the ETA of bringing back a single file or a

filesystem from backupsfilesystem from backups

Admin NodeAdmin Node

Only sysadmins log into this nodeOnly sysadmins log into this node Runs cluster management softwareRuns cluster management software


Only sysadmins log into this nodeOnly sysadmins log into this node Accessible only from within the clusterAccessible only from within the cluster

Runs cluster management softwareRuns cluster management software


Only admins log into this nodeOnly admins log into this node Runs cluster management softwareRuns cluster management software

User and quota managementUser and quota management Node managementNode management

Rebuild dead nodesRebuild dead nodes Monitor CPU utilization and network trafficMonitor CPU utilization and network traffic

Compute NodesCompute Nodes

Buy the fastest CPUs and bus speed you Buy the fastest CPUs and bus speed you can afford.can afford.

Memory size of each node depends on the Memory size of each node depends on the application mix.application mix.

Lots of hard disk space is not so much a Lots of hard disk space is not so much a priority since the nodes will primarily use priority since the nodes will primarily use shared space on the I/O node.shared space on the I/O node.


Buy the fastest CPUs and bus speed you Buy the fastest CPUs and bus speed you can afford.can afford. Don’t forget that some software companies Don’t forget that some software companies

license their software per node, so factor in license their software per node, so factor in software costssoftware costs

Stick with a proven technology over future Stick with a proven technology over future promisepromise

Memory size of each node depends on the Memory size of each node depends on the application mix. application mix.



Memory size of each node depends on the Memory size of each node depends on the application mix.application mix. 2 GB + for for large calculations2 GB + for for large calculations < 2 GB for financial databases< 2 GB for financial databases

Lots of hard disk space is not so much a Lots of hard disk space is not so much a priority since the nodes will primarily use priority since the nodes will primarily use shared space on the I/O node.shared space on the I/O node.



Memory size of each node depends on the Memory size of each node depends on the application mix.application mix.

Lots of hard disk space is not so much a Lots of hard disk space is not so much a priority since the nodes will primarily use priority since the nodes will primarily use shared space on the I/O node.shared space on the I/O node. Disks are cheap nowadays... 40GB EIDE is Disks are cheap nowadays... 40GB EIDE is

standard per nodestandard per node


Choose a CPU architecture you’re Choose a CPU architecture you’re comfortable withcomfortable with Intel: P4, Xeon, ItaniumIntel: P4, Xeon, Itanium AMD: Opteron, AthlonAMD: Opteron, Athlon Other: G4/G5Other: G4/G5

Consider that some algorithms require 2Consider that some algorithms require 2nn nodesnodes

32-bit Linux is free or close-to-free, 64-bit 32-bit Linux is free or close-to-free, 64-bit Red Hat Linux costs $1600 per nodeRed Hat Linux costs $1600 per node

Login NodeLogin Node

Users login hereUsers login here Only way to get into the clusterOnly way to get into the cluster Compile codeCompile code Job controlJob control


Users login hereUsers login here ssh or ssh -Xssh or ssh -X Cluster designers recommend 1 login node Cluster designers recommend 1 login node

per 64 compute nodesper 64 compute nodes Update /etc/profile.d so all users get the same Update /etc/profile.d so all users get the same

environment when they log inenvironment when they log in Only way to get into the clusterOnly way to get into the cluster Compile codeCompile code Job controlJob control


Users login hereUsers login here Only way to get into the clusterOnly way to get into the cluster

Static IP address (vs. DHCP addresses on all Static IP address (vs. DHCP addresses on all other cluster nodes)other cluster nodes)

Turn on built-in firewall softwareTurn on built-in firewall software

Compile codeCompile code Job controlJob control


Users login hereUsers login here Only way to get into the clusterOnly way to get into the cluster Compile codeCompile code

Licenses should be purchased for this node Licenses should be purchased for this node onlyonly

Don’t pay for more than you needDon’t pay for more than you need 2 licenses might be sufficient for code compilation 2 licenses might be sufficient for code compilation

for a departmentfor a department

Job controlJob control


Users login hereUsers login here Only way to get into the clusterOnly way to get into the cluster Compile codeCompile code Job control (using a scheduler)Job control (using a scheduler)

Choice of queues to access subset of Choice of queues to access subset of resourcesresources

Submit, delete, terminate jobsSubmit, delete, terminate jobs Check on job statusCheck on job status

Spare NodesSpare Nodes

Offline nodes that are put into service Offline nodes that are put into service when an existing node dieswhen an existing node dies

Use for spare partsUse for spare parts Use for testing environmentUse for testing environment

Cluster Install SoftwareCluster Install Software

Designed to make cluster installation Designed to make cluster installation easier (“cluster in a box” concept)easier (“cluster in a box” concept)

Decreases ETA of the install process Decreases ETA of the install process using automated stepsusing automated steps

Decreases chance of user errorDecreases chance of user error Choices:Choices:

OSCAROSCAR FelixFelix

IBM XCATIBM XCAT IBM CSMIBM CSM

Cluster Management SoftwareCluster Management Software

Run parallel commands via GUIRun parallel commands via GUI Or write Perl scripts for command-line controlOr write Perl scripts for command-line control

Install new nodes, rebuild corrupted nodesInstall new nodes, rebuild corrupted nodes Check on status of hardware (nodes, Check on status of hardware (nodes,

network connections)network connections) GangliaGanglia xpbsmonxpbsmon Myrinet tests (gm_board_info)Myrinet tests (gm_board_info)

Cluster Management SoftwareCluster Management Software

xpbsmon - xpbsmon - shows jobs shows jobs running that running that were were submitted via submitted via the schedulerthe scheduler

Cluster ConsistencyCluster Consistency

Rsync or rdist Rsync or rdist /etc/password, /etc/password, shadow, gshadow, shadow, gshadow, and group files from and group files from login node to compute login node to compute nodesnodes

Also consider (auto or Also consider (auto or manually) rsync’ing manually) rsync’ing /etc/profile.d files, pbs /etc/profile.d files, pbs config files, /etc/fstab, config files, /etc/fstab, etc.etc.

Local and Remote Local and Remote ManagementManagement

Local managementLocal management GUI desktop from console monitorGUI desktop from console monitor KVM switches to access each nodeKVM switches to access each node

Remote managementRemote management Console switchConsole switch

ssh in and see what’s on the console monitor screen ssh in and see what’s on the console monitor screen from your remote desktopfrom your remote desktop

Web-based toolsWeb-based tools GangliaGanglia ganglia.sourceforge.netganglia.sourceforge.net NetsaintNetsaint www.netsaint.org Big BrotherBig Brother www.bb4.com

GangliaGanglia

Tool for monitoring clusters of up to 2000 Tool for monitoring clusters of up to 2000 nodesnodes

Used on over 500 clusters worldwideUsed on over 500 clusters worldwide For multiple OS’s and CPU architecturesFor multiple OS’s and CPU architectures

# ssh -X coffee.chem.upenn.edu# ssh -X coffee.chem.upenn.edu

# ssh coffeeadmin# ssh coffeeadmin

# mozilla &# mozilla &

Open Open http://coffeeadmin/ganglia

Periodically auto-refreshes web pagePeriodically auto-refreshes web page

GangliaGanglia

Scheduling Software (PBS)Scheduling Software (PBS)

Set up queues for different groups of users Set up queues for different groups of users based on resource needs (i.e. not everyone based on resource needs (i.e. not everyone needs Myrinet; some users only need 1 node)needs Myrinet; some users only need 1 node)

The world does not end if one node goes down; The world does not end if one node goes down; the scheduler will run the job on another nodethe scheduler will run the job on another node

Make sure pbs_server and pbs_sched is running Make sure pbs_server and pbs_sched is running on login nodeon login node

Make sure pbs_mom is running on all compute Make sure pbs_mom is running on all compute nodes, but not on login, admin, or I/O nodesnodes, but not on login, admin, or I/O nodes

Scheduling SoftwareScheduling Software

OpenPBSOpenPBS PBS ProPBS Pro OthersOthers


OpenPBSOpenPBS Limit users by number of jobsLimit users by number of jobs Good support via messageboardsGood support via messageboards *** FREE ****** FREE ***

PBS ProPBS Pro OthersOthers


OpenPBSOpenPBS PBS ProPBS Pro

The “pro” version of OpenPBSThe “pro” version of OpenPBS Limit by nodes, not just jobs per userLimit by nodes, not just jobs per user Must pay for support ($25 per CPU, or $3200 Must pay for support ($25 per CPU, or $3200

for a 128 CPU cluster)for a 128 CPU cluster)

OthersOthers


OpenPBSOpenPBS PBS ProPBS Pro OthersOthers

Load Share FacilityLoad Share Facility CodeineCodeine MauiMaui

MPI SoftwareMPI Software

MPICH (Argonne National Labs)MPICH (Argonne National Labs) LAM/MPI (OSC/Univ. of Notre Dame)LAM/MPI (OSC/Univ. of Notre Dame) MPI-GM (Myricom)MPI-GM (Myricom) MPI Pro (MSTi Software)MPI Pro (MSTi Software)

Programmed by one of the original Programmed by one of the original developers of MPICHdevelopers of MPICH

Claims to be 20% faster than MPICHClaims to be 20% faster than MPICH Costs $1200 plus support per yearCosts $1200 plus support per year

Compilers and LibrariesCompilers and Libraries

CompilersCompilers gcc/g77gcc/g77 www.gnu.org/softwarewww.gnu.org/software

Portland GroupPortland Group www.pgroup.comwww.pgroup.com

IntelIntel www.developer.intel.comwww.developer.intel.com

LibrariesLibraries BLASBLAS ATLAS - portable BLASATLAS - portable BLAS www.math-atlas.sourceforge.netwww.math-atlas.sourceforge.net

LAPACKLAPACK SCALAPACK - MPI-based LAPACKSCALAPACK - MPI-based LAPACK FFTW - Fast Fourier TransformFFTW - Fast Fourier Transform www.fftw.orgwww.fftw.org

many, many moremany, many more

Cluster SecurityCluster Security

Securing/patching your Linux cluster is Securing/patching your Linux cluster is much like securing/patching your Linux much like securing/patching your Linux desktopdesktop

Keep an eye out for the latest patchesKeep an eye out for the latest patches Install a patch only if necessary and do it Install a patch only if necessary and do it

on a test machine firston a test machine first Make sure there’s a way to back out of a Make sure there’s a way to back out of a

patch before installing itpatch before installing it


Get rid of unneeded softwareGet rid of unneeded software Limit who installs and what gets installedLimit who installs and what gets installed Close unused ports and servicesClose unused ports and services Limit login service to ssh between login Limit login service to ssh between login

node and outside worldnode and outside world Use ssh to tunnel X connections safelyUse ssh to tunnel X connections safely Limit access using hosts.allow/denyLimit access using hosts.allow/deny Use scp and sftp for secure file transferUse scp and sftp for secure file transfer


Carefully configure NFSCarefully configure NFS Upgrade to the latest, safest Samba Upgrade to the latest, safest Samba

version, if usedversion, if used Disable Apache if not neededDisable Apache if not needed Turn on built-in Linux firewall softwareTurn on built-in Linux firewall software

TroubleshootingTroubleshooting

Make sure the core cluster services are Make sure the core cluster services are runningrunning Scheduler, MPI, NFS, cluster managersScheduler, MPI, NFS, cluster managers

Make sure software licenses are up-to-Make sure software licenses are up-to-datedate

Scan logs for break-in attemptsScan logs for break-in attempts Keep a written journal of all patches Keep a written journal of all patches

installs and upgradesinstalls and upgrades


Sometimes a reboot will fix the problemSometimes a reboot will fix the problem If you reboot the login node where the If you reboot the login node where the

scheduler is running, be sure the scheduler is scheduler is running, be sure the scheduler is started after the rebootstarted after the reboot

Any jobs in the queues will be flushedAny jobs in the queues will be flushed Hard-rebooting hardware, such as tape Hard-rebooting hardware, such as tape

drives, usually fixes the problemdrives, usually fixes the problem


Reboot order: I/O node, login node, admin Reboot order: I/O node, login node, admin node, compute nodes (i.e. master nodes node, compute nodes (i.e. master nodes first, then slave nodes)first, then slave nodes)

Rebuilding a node takes 30 minutes with Rebuilding a node takes 30 minutes with the cluster manager; reconfiguring it may the cluster manager; reconfiguring it may take an hour more take an hour more

Vendor ChoicesVendor Choices

DellDell IBMIBM Western ScientificWestern Scientific Aspen SystemsAspen Systems RacksaverRacksaver eRackseRacks Penguin ComputingPenguin Computing Many, many othersMany, many others

Go with a proven vendorGo with a proven vendor Get every vendor to spec Get every vendor to spec

out the same hardware out the same hardware and software before you and software before you compare pricescompare prices

Compare service Compare service agreementsagreements

How fast can they deliver How fast can they deliver a working cluster?a working cluster?

Buying Commercial SoftwareBuying Commercial Software

Is it worth the Is it worth the money?money?

Is it proven software?Is it proven software? Are all the bells and Are all the bells and

whistles really whistles really necessary?necessary?

Paid software does Paid software does not necessarily have not necessarily have the best supportthe best support

Cluster TipsCluster Tips

Keep all sysadmin Keep all sysadmin scripts in an easily scripts in an easily accessible placeaccessible place /4sysadmin/4sysadmin /usr/local/4sysadmin /usr/local/4sysadmin

Cluster TipsCluster Tips

Force everyone to Force everyone to use the scheduler to use the scheduler to run their jobs (even run their jobs (even uniprocessor jobs)uniprocessor jobs) Police itPolice it Don’t let users get Don’t let users get

away with thingsaway with things Wrapping some Wrapping some

applications into a applications into a scheduler script can scheduler script can be trickybe tricky

Cluster UpgradesCluster Upgrades

Nodes become obsolete in 2 to 3 yearsNodes become obsolete in 2 to 3 years Upgrade banks of nodes at a timeUpgrade banks of nodes at a time If upgrading to a new CPU, check for If upgrading to a new CPU, check for

compatibility problems and new A/C compatibility problems and new A/C requirementsrequirements

Upgrading memory and disk space is easy Upgrading memory and disk space is easy but tediousbut tedious

Cluster UpgradesCluster Upgrades

Upgrading the OS can be a major taskUpgrading the OS can be a major task Even installing patches can be a major Even installing patches can be a major

tasktask

Common Sense Cluster Common Sense Cluster AdministrationAdministration

Plan a little before you do anythingPlan a little before you do anything Keep a journal of everything you doKeep a journal of everything you do Create procedures that are easy to follow Create procedures that are easy to follow

in times of stressin times of stress Document everything!Document everything!

Common Sense Cluster Common Sense Cluster AdministrationAdministration

Test software before announcing itTest software before announcing it Educate and “radiate” your cluster Educate and “radiate” your cluster

knowledge to your support teamknowledge to your support team

coffee.chemcoffee.chem

6 P.I.’s in Chemistry funded it6 P.I.’s in Chemistry funded it Located in FBA121 next to A/C3Located in FBA121 next to A/C3 69 dual-CPU node cluster69 dual-CPU node cluster

64 compute nodes64 compute nodes 1 login node1 login node 1 admin node1 admin node 1 I/O node1 I/O node 1 backup node1 backup node 1 firewall node1 firewall node


Myrinet on 32 compute nodes, gigE on Myrinet on 32 compute nodes, gigE on other 32other 32

2 TB RAID5 array (1.7 TB formatted) 2 TB RAID5 array (1.7 TB formatted)

12-slot, 4.8 TB capacity LTO tape 12-slot, 4.8 TB capacity LTO tape drivedrive

2U fold-out console with LCD monitor, 2U fold-out console with LCD monitor, keyboard, trackpadkeyboard, trackpad


5 KVM daisy chained switches5 KVM daisy chained switches 9 APC 3000 UPS units each connected to 9 APC 3000 UPS units each connected to

their own circuittheir own circuit 3 42U racks3 42U racks


Red Hat 9Red Hat 9 Felix cluster install and management Felix cluster install and management

softwaresoftware PBS ProPBS Pro MPICH, LAM/MPI, MPICH-GMMPICH, LAM/MPI, MPICH-GM gnu and Portland Group compilersgnu and Portland Group compilers BLAS, SCALAPACK, ATLAS librariesBLAS, SCALAPACK, ATLAS libraries Gaussian98 (Gaussian03 + Linda soon)Gaussian98 (Gaussian03 + Linda soon)


/data on I/O node (coffeecompute00) /data on I/O node (coffeecompute00) holds common apps and user home holds common apps and user home directoriesdirectories

Admin node (coffeeadmin) runs Felix Admin node (coffeeadmin) runs Felix cluster managercluster manager

Compute nodes (coffeecompute01..64)Compute nodes (coffeecompute01..64) Every node in the cluster can access /data Every node in the cluster can access /data

via NFSvia NFS


Can ssh into compute nodes, admin, and Can ssh into compute nodes, admin, and I/O node only via login nodeI/O node only via login node

Backup node (javabean) temporarily has Backup node (javabean) temporarily has our backup device attached (we use tar our backup device attached (we use tar right now)right now)

Logging Into coffee.chemLogging Into coffee.chem

Everyone in this room will have user Everyone in this room will have user accounts on coffee.chem and home accounts on coffee.chem and home directories in /data/staffdirectories in /data/staff

Our existence on the system is for Our existence on the system is for Chemistry’s benefitChemistry’s benefit

Support scripts are found in /4sysadminSupport scripts are found in /4sysadmin If a reboot is necessary, make sure that If a reboot is necessary, make sure that

PBS is started (/etc/init.d/pbs start)PBS is started (/etc/init.d/pbs start)

Compiling and Running CodeCompiling and Running Code

pgCC -Mmpi -o test hello.cpppgCC -Mmpi -o test hello.cpp mpirun -np 8 testmpirun -np 8 test

Compiling CodeCompiling Code

pgCC -Mmpi -o test hello.cpppgCC -Mmpi -o test hello.cpp MPICH includes mpicc and mpif77 to MPICH includes mpicc and mpif77 to

compile and link MPI programscompile and link MPI programs Scripts that pass the MPI library arguments to Scripts that pass the MPI library arguments to

cc and f77cc and f77

Running CodeRunning Code

mpirun -np XXX -machinefile YYY -nolocal mpirun -np XXX -machinefile YYY -nolocal testtest -np = number of processors-np = number of processors -machinefile = filename with list of processors -machinefile = filename with list of processors

you want to run job onyou want to run job on -nolocal = don’t run the job locally-nolocal = don’t run the job locally

Submitting a JobSubmitting a Job

3 queues to choose from3 queues to choose from CoffeeqCoffeeq

general purpose queuegeneral purpose queue 12 hours max run time12 hours max run time 16 processors max16 processors max

EspressoqEspressoq Higher priority than coffeeqHigher priority than coffeeq 3 weeks max run time3 weeks max run time

Some may still use piq, but this will go away Some may still use piq, but this will go away soonsoon

Submitting a JobSubmitting a Job

Prepare a scheduler scriptPrepare a scheduler script #!/bin/tcsh#!/bin/tcsh #PBS -l arch=linux#PBS -l arch=linux {define architecture}{define architecture} #PBS -l cput=1:00:00#PBS -l cput=1:00:00 {define CPU time needed}{define CPU time needed} #PBS -l mem=400mb#PBS -l mem=400mb {define memory space needed}{define memory space needed} #PBS -l nodes=64:ppn=1#PBS -l nodes=64:ppn=1 {define number of nodes needed}{define number of nodes needed} #PBS -m e#PBS -m e {mail me the results}{mail me the results} #PBS -c c#PBS -c c {minimal checkpointing}{minimal checkpointing} #PBS -k oe#PBS -k oe {keep the output and errors}{keep the output and errors} #PBS -q coffeeq#PBS -q coffeeq {run the job on coffeeq}{run the job on coffeeq} mpirun -np 8 -machinefile machines_gige_32.LINUX mpirun -np 8 -machinefile machines_gige_32.LINUX

/data/staff/din/newhello/data/staff/din/newhello

qsub the scheduler scriptqsub the scheduler script

More PBS CommandsMore PBS Commands

Check on the status of all submitted jobs Check on the status of all submitted jobs with: qstatwith: qstat

Submit a job with: qsubSubmit a job with: qsub Delete a job with: qdelDelete a job with: qdel Terminate the execution of a job with: Terminate the execution of a job with:

qtermqterm See all your available compute node See all your available compute node

resources with: pbsnodes -aresources with: pbsnodes -a

Node TermsNode Terms

Login node = Service node = Head node = the node users log intoLogin node = Service node = Head node = the node users log into Master scheduler node = node where scheduler runs, usually login nodeMaster scheduler node = node where scheduler runs, usually login node Admin node = the node the sysadmin logs into to gain access to cluster Admin node = the node the sysadmin logs into to gain access to cluster

management appsmanagement apps Compute node = one or more nodes that perform pieces of a larger Compute node = one or more nodes that perform pieces of a larger

computationcomputation Storage node = the node that has the RAID array or SAN attached to itStorage node = the node that has the RAID array or SAN attached to it Backup node = the node that has the backup solution attached to itBackup node = the node that has the backup solution attached to it I/O node = can combine features of storage and backup nodesI/O node = can combine features of storage and backup nodes Visualization node = the node that contains a graphics card and Visualization node = the node that contains a graphics card and

graphics console; multiple visualization nodes can be combined in a graphics console; multiple visualization nodes can be combined in a matrix to form a video wallmatrix to form a video wall

Spare node = nodes that are not in service, but can be rebuilt to take the Spare node = nodes that are not in service, but can be rebuilt to take the place of a compute node or, in some cases, an admin or login nodeplace of a compute node or, in some cases, an admin or login node

ReferencesReferences

Bookman, Charles. Bookman, Charles. Linux Clustering: Building and Maintaining Linux Clustering: Building and Maintaining Linux ClustersLinux Clusters. New Riders, Indianapolis, Indiana, 2003.. New Riders, Indianapolis, Indiana, 2003.

Howse, Martin. "Dropping the Bomb: AMD Opteron" in Howse, Martin. "Dropping the Bomb: AMD Opteron" in Linux User Linux User & Developer& Developer, Issue 33. pp 33-36., Issue 33. pp 33-36.

Robertson, Alan. "Highly-Affordable High Availability" in Robertson, Alan. "Highly-Affordable High Availability" in Linux Linux MagazineMagazine, November 2003. pp 16-21., November 2003. pp 16-21.

The Seventh LCI Workshop Systems Track Notes. The Seventh LCI Workshop Systems Track Notes. Linux Linux Clusters Institute, March 24-28, 2003.Clusters Institute, March 24-28, 2003.

Sterling, Thomas et al. Sterling, Thomas et al. How to Build a Beowulf: A Guide to the How to Build a Beowulf: A Guide to the Implementation and Application of PC ClustersImplementation and Application of PC Clusters. The MIT Press, . The MIT Press, Cambridge, Massachusetts, 1999.Cambridge, Massachusetts, 1999.

Vrenios, Alex. Vrenios, Alex. Linux Cluster ArchitectureLinux Cluster Architecture. Sams Publishing, . Sams Publishing, Indianapolis, Indiana, 2002.Indianapolis, Indiana, 2002.

coffee.chem Contact Listcoffee.chem Contact List

Dell hardware problems 800-234-1490Dell hardware problems 800-234-1490 Myrinet problems Myrinet problems [email protected] ““Very limited” software support Very limited” software support

[email protected] PGI Compiler issues PGI Compiler issues [email protected]

Introduction toIntroduction toLinux ClustersLinux Clusters

Clare DinClare Din

SAS ComputingSAS Computing

University of PennsylvaniaUniversity of Pennsylvania

March 15, 2004March 15, 2004

Introduction to Linux Clusters Clare Din SAS Computing University of Pennsylvania March 15, 2004.

Documents

Transcript of Introduction to Linux Clusters Clare Din SAS Computing University of Pennsylvania March 15, 2004.