Introduction to Linux Clusters Clare Din SAS Computing University of Pennsylvania March 15, 2004.
-
Upload
howard-higgins -
Category
Documents
-
view
216 -
download
0
Transcript of Introduction to Linux Clusters Clare Din SAS Computing University of Pennsylvania March 15, 2004.
Introduction toIntroduction toLinux ClustersLinux Clusters
Clare DinClare Din
SAS ComputingSAS Computing
University of PennsylvaniaUniversity of Pennsylvania
March 15, 2004March 15, 2004
Cluster ComponentsCluster Components
HardwareHardware NodesNodes Disk arrayDisk array Networking gearNetworking gear Backup deviceBackup device Admin front endAdmin front end UPSUPS Rack unitsRack units
SoftwareSoftware Operating systemOperating system MPIMPI CompilersCompilers SchedulerScheduler
Cluster ComponentsCluster Components
HardwareHardware NodesNodes
Compute nodesCompute nodes Admin nodeAdmin node I/O nodeI/O node Login nodeLogin node
Disk arrayDisk array Networking gearNetworking gear Backup deviceBackup device Admin front endAdmin front end
SoftwareSoftware Operating systemOperating system CompilersCompilers SchedulerScheduler MPIMPI
Cluster ComponentsCluster Components
HardwareHardware Disk arrayDisk array
RAID5RAID5 SCSI 320SCSI 320 10k+ RPM, TB+ 10k+ RPM, TB+
capacitycapacity NFS-mounted from I/O NFS-mounted from I/O
nodenode Networking gearNetworking gear Backup deviceBackup device Admin front endAdmin front end
SoftwareSoftware Operating systemOperating system CompilersCompilers SchedulerScheduler MPIMPI
Cluster ComponentsCluster Components
HardwareHardware Networking gearNetworking gear
Myrinet, gigE, 10/100Myrinet, gigE, 10/100 SwitchesSwitches CablesCables Networking cardsNetworking cards
Backup deviceBackup device Admin front endAdmin front end UPSUPS Rack unitsRack units
SoftwareSoftware Operating systemOperating system CompilersCompilers SchedulerScheduler MPIMPI
Cluster ComponentsCluster Components
HardwareHardware Backup deviceBackup device
AIT3, DLT, LTOAIT3, DLT, LTO N-slot cartridge driveN-slot cartridge drive SANSAN
Admin front endAdmin front end UPSUPS Rack unitsRack units
SoftwareSoftware Operating systemOperating system CompilersCompilers SchedulerScheduler MPIMPI
Cluster ComponentsCluster Components
HardwareHardware Admin front endAdmin front end
Console (keyboard, Console (keyboard, monitor, mouse)monitor, mouse)
KVM switchesKVM switches KVM cablesKVM cables
UPSUPS Rack unitsRack units
Cluster ComponentsCluster Components
HardwareHardware UPSUPS
APC SmartUPS 3000APC SmartUPS 3000 3 per 42U rack3 per 42U rack
Rack unitsRack units
SoftwareSoftware Operating systemOperating system CompilersCompilers SchedulerScheduler MPIMPI
Cluster ComponentsCluster Components
HardwareHardware Rack unitsRack units
42U, standard or deep42U, standard or deep
SoftwareSoftware Operating systemOperating system CompilersCompilers SchedulerScheduler MPIMPI
Cluster ComponentsCluster Components
SoftwareSoftware Operating systemOperating system
Red Hat 9+ LinuxRed Hat 9+ Linux Debian LinuxDebian Linux SUSE LinuxSUSE Linux Mandrake LinuxMandrake Linux FreeBSD and othersFreeBSD and others
MPIMPI CompilersCompilers SchedulerScheduler
Cluster ComponentsCluster Components
SoftwareSoftware MPIMPI
MPICHMPICH LAM/MPILAM/MPI MPI-GMMPI-GM MPI ProMPI Pro
CompilersCompilers SchedulerScheduler
Cluster ComponentsCluster Components
SoftwareSoftware CompilersCompilers
gnugnu Portland GroupPortland Group IntelIntel
SchedulerScheduler
Cluster ComponentsCluster Components
SoftwareSoftware SchedulerScheduler
OpenPBSOpenPBS PBS ProPBS Pro MauiMaui
Filesystem RequirementsFilesystem Requirements
Journalled filesystemJournalled filesystem Reboots happen more quickly after a Reboots happen more quickly after a
crashcrash Slight performance hit for this featureSlight performance hit for this feature ext3 is a popular choice (old ext2 was ext3 is a popular choice (old ext2 was
not journalled)not journalled)
Space and Power Space and Power RequirementsRequirements
SpaceSpace Standard 42U rack is Standard 42U rack is
about 24”W x 80”H x about 24”W x 80”H x 40”D40”D
Blade units give you Blade units give you more than 1 node per more than 1 node per 1U space in a deeper 1U space in a deeper rackrack
Cable management Cable management inside the rackinside the rack
Consider overhead or Consider overhead or raised floor cabling for raised floor cabling for the external cablesthe external cables
PowerPower 67 node Xeon cluster 67 node Xeon cluster
consumes 19,872W = consumes 19,872W = 5.65 tons of A/C to 5.65 tons of A/C to keep it coolkeep it cool
Ideally, each UPS plug Ideally, each UPS plug should connect to its should connect to its own circuitown circuit
Clusters (especially Clusters (especially blades) run real hot; blades) run real hot; make sure there is make sure there is adequate A/C and adequate A/C and ventilationventilation
Network RequirementsNetwork Requirements
External NetworkExternal Network One 10mbps One 10mbps
network line is network line is adequate (all adequate (all computation and computation and message passing message passing is within the is within the cluster)cluster)
Internal NetworkInternal Network gigEgigE MyrinetMyrinet Some comboSome combo
Base your net gear Base your net gear selection on selection on whether most of whether most of your jobs are CPU-your jobs are CPU-bound or I/O boundbound or I/O bound
Network Choices ComparedNetwork Choices Compared
Fast Ethernet (100BT)Fast Ethernet (100BT) 0.1 Gb/s (or 100 Mb/s) bandwidth0.1 Gb/s (or 100 Mb/s) bandwidth Essentially freeEssentially free
gigEgigE 0.4 Gb/s to 0.64 Gb/s bandwidth0.4 Gb/s to 0.64 Gb/s bandwidth ~$400 per node~$400 per node
MyrinetMyrinet 1.2 Gb/s to 2.0 Gb/s bandwidth1.2 Gb/s to 2.0 Gb/s bandwidth ~$1000 per node~$1000 per node Scales to thousands of nodesScales to thousands of nodes Buy fiber instead of copper cablesBuy fiber instead of copper cables
Networking Gear Speeds
0
500
1000
1500
2000
2500
Fast Ethernet gigE Myrinet
I/O NodeI/O Node
Globally accessible filesystem (RAID5 disk Globally accessible filesystem (RAID5 disk array)array)
Backup deviceBackup device
I/O NodeI/O Node
Globally accessible filesystem (RAID5 disk Globally accessible filesystem (RAID5 disk array)array) NFS share itNFS share it Put user home directories, apps, and scratch Put user home directories, apps, and scratch
space directories on it so all compute nodes space directories on it so all compute nodes can access themcan access them
Enforce quotas on home directoriesEnforce quotas on home directories
Backup deviceBackup device
I/O NodeI/O Node
Globally accessible filesystem (RAID5 disk Globally accessible filesystem (RAID5 disk array)array)
Backup deviceBackup device Make sure your device and software is Make sure your device and software is
compatible with your operating systemcompatible with your operating system Plan a good backup strategyPlan a good backup strategy Test the ETA of bringing back a single file or a Test the ETA of bringing back a single file or a
filesystem from backupsfilesystem from backups
Admin NodeAdmin Node
Only sysadmins log into this nodeOnly sysadmins log into this node Runs cluster management softwareRuns cluster management software
Admin NodeAdmin Node
Only sysadmins log into this nodeOnly sysadmins log into this node Accessible only from within the clusterAccessible only from within the cluster
Runs cluster management softwareRuns cluster management software
Admin NodeAdmin Node
Only admins log into this nodeOnly admins log into this node Runs cluster management softwareRuns cluster management software
User and quota managementUser and quota management Node managementNode management
Rebuild dead nodesRebuild dead nodes Monitor CPU utilization and network trafficMonitor CPU utilization and network traffic
Compute NodesCompute Nodes
Buy the fastest CPUs and bus speed you Buy the fastest CPUs and bus speed you can afford.can afford.
Memory size of each node depends on the Memory size of each node depends on the application mix.application mix.
Lots of hard disk space is not so much a Lots of hard disk space is not so much a priority since the nodes will primarily use priority since the nodes will primarily use shared space on the I/O node.shared space on the I/O node.
Compute NodesCompute Nodes
Buy the fastest CPUs and bus speed you Buy the fastest CPUs and bus speed you can afford.can afford. Don’t forget that some software companies Don’t forget that some software companies
license their software per node, so factor in license their software per node, so factor in software costssoftware costs
Stick with a proven technology over future Stick with a proven technology over future promisepromise
Memory size of each node depends on the Memory size of each node depends on the application mix. application mix.
Compute NodesCompute Nodes
Buy the fastest CPUs and bus speed you Buy the fastest CPUs and bus speed you can afford.can afford.
Memory size of each node depends on the Memory size of each node depends on the application mix.application mix. 2 GB + for for large calculations2 GB + for for large calculations < 2 GB for financial databases< 2 GB for financial databases
Lots of hard disk space is not so much a Lots of hard disk space is not so much a priority since the nodes will primarily use priority since the nodes will primarily use shared space on the I/O node.shared space on the I/O node.
Compute NodesCompute Nodes
Buy the fastest CPUs and bus speed you Buy the fastest CPUs and bus speed you can afford.can afford.
Memory size of each node depends on the Memory size of each node depends on the application mix.application mix.
Lots of hard disk space is not so much a Lots of hard disk space is not so much a priority since the nodes will primarily use priority since the nodes will primarily use shared space on the I/O node.shared space on the I/O node. Disks are cheap nowadays... 40GB EIDE is Disks are cheap nowadays... 40GB EIDE is
standard per nodestandard per node
Compute NodesCompute Nodes
Choose a CPU architecture you’re Choose a CPU architecture you’re comfortable withcomfortable with Intel: P4, Xeon, ItaniumIntel: P4, Xeon, Itanium AMD: Opteron, AthlonAMD: Opteron, Athlon Other: G4/G5Other: G4/G5
Consider that some algorithms require 2Consider that some algorithms require 2nn nodesnodes
32-bit Linux is free or close-to-free, 64-bit 32-bit Linux is free or close-to-free, 64-bit Red Hat Linux costs $1600 per nodeRed Hat Linux costs $1600 per node
Login NodeLogin Node
Users login hereUsers login here Only way to get into the clusterOnly way to get into the cluster Compile codeCompile code Job controlJob control
Login NodeLogin Node
Users login hereUsers login here ssh or ssh -Xssh or ssh -X Cluster designers recommend 1 login node Cluster designers recommend 1 login node
per 64 compute nodesper 64 compute nodes Update /etc/profile.d so all users get the same Update /etc/profile.d so all users get the same
environment when they log inenvironment when they log in Only way to get into the clusterOnly way to get into the cluster Compile codeCompile code Job controlJob control
Login NodeLogin Node
Users login hereUsers login here Only way to get into the clusterOnly way to get into the cluster
Static IP address (vs. DHCP addresses on all Static IP address (vs. DHCP addresses on all other cluster nodes)other cluster nodes)
Turn on built-in firewall softwareTurn on built-in firewall software
Compile codeCompile code Job controlJob control
Login NodeLogin Node
Users login hereUsers login here Only way to get into the clusterOnly way to get into the cluster Compile codeCompile code
Licenses should be purchased for this node Licenses should be purchased for this node onlyonly
Don’t pay for more than you needDon’t pay for more than you need 2 licenses might be sufficient for code compilation 2 licenses might be sufficient for code compilation
for a departmentfor a department
Job controlJob control
Login NodeLogin Node
Users login hereUsers login here Only way to get into the clusterOnly way to get into the cluster Compile codeCompile code Job control (using a scheduler)Job control (using a scheduler)
Choice of queues to access subset of Choice of queues to access subset of resourcesresources
Submit, delete, terminate jobsSubmit, delete, terminate jobs Check on job statusCheck on job status
Spare NodesSpare Nodes
Offline nodes that are put into service Offline nodes that are put into service when an existing node dieswhen an existing node dies
Use for spare partsUse for spare parts Use for testing environmentUse for testing environment
Cluster Install SoftwareCluster Install Software
Designed to make cluster installation Designed to make cluster installation easier (“cluster in a box” concept)easier (“cluster in a box” concept)
Decreases ETA of the install process Decreases ETA of the install process using automated stepsusing automated steps
Decreases chance of user errorDecreases chance of user error Choices:Choices:
OSCAROSCAR FelixFelix
IBM XCATIBM XCAT IBM CSMIBM CSM
Cluster Management SoftwareCluster Management Software
Run parallel commands via GUIRun parallel commands via GUI Or write Perl scripts for command-line controlOr write Perl scripts for command-line control
Install new nodes, rebuild corrupted nodesInstall new nodes, rebuild corrupted nodes Check on status of hardware (nodes, Check on status of hardware (nodes,
network connections)network connections) GangliaGanglia xpbsmonxpbsmon Myrinet tests (gm_board_info)Myrinet tests (gm_board_info)
Cluster Management SoftwareCluster Management Software
xpbsmon - xpbsmon - shows jobs shows jobs running that running that were were submitted via submitted via the schedulerthe scheduler
Cluster ConsistencyCluster Consistency
Rsync or rdist Rsync or rdist /etc/password, /etc/password, shadow, gshadow, shadow, gshadow, and group files from and group files from login node to compute login node to compute nodesnodes
Also consider (auto or Also consider (auto or manually) rsync’ing manually) rsync’ing /etc/profile.d files, pbs /etc/profile.d files, pbs config files, /etc/fstab, config files, /etc/fstab, etc.etc.
Local and Remote Local and Remote ManagementManagement
Local managementLocal management GUI desktop from console monitorGUI desktop from console monitor KVM switches to access each nodeKVM switches to access each node
Remote managementRemote management Console switchConsole switch
ssh in and see what’s on the console monitor screen ssh in and see what’s on the console monitor screen from your remote desktopfrom your remote desktop
Web-based toolsWeb-based tools GangliaGanglia ganglia.sourceforge.netganglia.sourceforge.net NetsaintNetsaint www.netsaint.org Big BrotherBig Brother www.bb4.com
GangliaGanglia
Tool for monitoring clusters of up to 2000 Tool for monitoring clusters of up to 2000 nodesnodes
Used on over 500 clusters worldwideUsed on over 500 clusters worldwide For multiple OS’s and CPU architecturesFor multiple OS’s and CPU architectures
# ssh -X coffee.chem.upenn.edu# ssh -X coffee.chem.upenn.edu
# ssh coffeeadmin# ssh coffeeadmin
# mozilla &# mozilla &
Open Open http://coffeeadmin/ganglia
Periodically auto-refreshes web pagePeriodically auto-refreshes web page
GangliaGanglia
GangliaGanglia
GangliaGanglia
Scheduling Software (PBS)Scheduling Software (PBS)
Set up queues for different groups of users Set up queues for different groups of users based on resource needs (i.e. not everyone based on resource needs (i.e. not everyone needs Myrinet; some users only need 1 node)needs Myrinet; some users only need 1 node)
The world does not end if one node goes down; The world does not end if one node goes down; the scheduler will run the job on another nodethe scheduler will run the job on another node
Make sure pbs_server and pbs_sched is running Make sure pbs_server and pbs_sched is running on login nodeon login node
Make sure pbs_mom is running on all compute Make sure pbs_mom is running on all compute nodes, but not on login, admin, or I/O nodesnodes, but not on login, admin, or I/O nodes
Scheduling SoftwareScheduling Software
OpenPBSOpenPBS PBS ProPBS Pro OthersOthers
Scheduling SoftwareScheduling Software
OpenPBSOpenPBS Limit users by number of jobsLimit users by number of jobs Good support via messageboardsGood support via messageboards *** FREE ****** FREE ***
PBS ProPBS Pro OthersOthers
Scheduling SoftwareScheduling Software
OpenPBSOpenPBS PBS ProPBS Pro
The “pro” version of OpenPBSThe “pro” version of OpenPBS Limit by nodes, not just jobs per userLimit by nodes, not just jobs per user Must pay for support ($25 per CPU, or $3200 Must pay for support ($25 per CPU, or $3200
for a 128 CPU cluster)for a 128 CPU cluster)
OthersOthers
Scheduling SoftwareScheduling Software
OpenPBSOpenPBS PBS ProPBS Pro OthersOthers
Load Share FacilityLoad Share Facility CodeineCodeine MauiMaui
MPI SoftwareMPI Software
MPICH (Argonne National Labs)MPICH (Argonne National Labs) LAM/MPI (OSC/Univ. of Notre Dame)LAM/MPI (OSC/Univ. of Notre Dame) MPI-GM (Myricom)MPI-GM (Myricom) MPI Pro (MSTi Software)MPI Pro (MSTi Software)
Programmed by one of the original Programmed by one of the original developers of MPICHdevelopers of MPICH
Claims to be 20% faster than MPICHClaims to be 20% faster than MPICH Costs $1200 plus support per yearCosts $1200 plus support per year
Compilers and LibrariesCompilers and Libraries
CompilersCompilers gcc/g77gcc/g77 www.gnu.org/softwarewww.gnu.org/software
Portland GroupPortland Group www.pgroup.comwww.pgroup.com
IntelIntel www.developer.intel.comwww.developer.intel.com
LibrariesLibraries BLASBLAS ATLAS - portable BLASATLAS - portable BLAS www.math-atlas.sourceforge.netwww.math-atlas.sourceforge.net
LAPACKLAPACK SCALAPACK - MPI-based LAPACKSCALAPACK - MPI-based LAPACK FFTW - Fast Fourier TransformFFTW - Fast Fourier Transform www.fftw.orgwww.fftw.org
many, many moremany, many more
Cluster SecurityCluster Security
Securing/patching your Linux cluster is Securing/patching your Linux cluster is much like securing/patching your Linux much like securing/patching your Linux desktopdesktop
Keep an eye out for the latest patchesKeep an eye out for the latest patches Install a patch only if necessary and do it Install a patch only if necessary and do it
on a test machine firston a test machine first Make sure there’s a way to back out of a Make sure there’s a way to back out of a
patch before installing itpatch before installing it
Cluster SecurityCluster Security
Get rid of unneeded softwareGet rid of unneeded software Limit who installs and what gets installedLimit who installs and what gets installed Close unused ports and servicesClose unused ports and services Limit login service to ssh between login Limit login service to ssh between login
node and outside worldnode and outside world Use ssh to tunnel X connections safelyUse ssh to tunnel X connections safely Limit access using hosts.allow/denyLimit access using hosts.allow/deny Use scp and sftp for secure file transferUse scp and sftp for secure file transfer
Cluster SecurityCluster Security
Carefully configure NFSCarefully configure NFS Upgrade to the latest, safest Samba Upgrade to the latest, safest Samba
version, if usedversion, if used Disable Apache if not neededDisable Apache if not needed Turn on built-in Linux firewall softwareTurn on built-in Linux firewall software
TroubleshootingTroubleshooting
Make sure the core cluster services are Make sure the core cluster services are runningrunning Scheduler, MPI, NFS, cluster managersScheduler, MPI, NFS, cluster managers
Make sure software licenses are up-to-Make sure software licenses are up-to-datedate
Scan logs for break-in attemptsScan logs for break-in attempts Keep a written journal of all patches Keep a written journal of all patches
installs and upgradesinstalls and upgrades
TroubleshootingTroubleshooting
Sometimes a reboot will fix the problemSometimes a reboot will fix the problem If you reboot the login node where the If you reboot the login node where the
scheduler is running, be sure the scheduler is scheduler is running, be sure the scheduler is started after the rebootstarted after the reboot
Any jobs in the queues will be flushedAny jobs in the queues will be flushed Hard-rebooting hardware, such as tape Hard-rebooting hardware, such as tape
drives, usually fixes the problemdrives, usually fixes the problem
TroubleshootingTroubleshooting
Reboot order: I/O node, login node, admin Reboot order: I/O node, login node, admin node, compute nodes (i.e. master nodes node, compute nodes (i.e. master nodes first, then slave nodes)first, then slave nodes)
Rebuilding a node takes 30 minutes with Rebuilding a node takes 30 minutes with the cluster manager; reconfiguring it may the cluster manager; reconfiguring it may take an hour more take an hour more
Vendor ChoicesVendor Choices
DellDell IBMIBM Western ScientificWestern Scientific Aspen SystemsAspen Systems RacksaverRacksaver eRackseRacks Penguin ComputingPenguin Computing Many, many othersMany, many others
Go with a proven vendorGo with a proven vendor Get every vendor to spec Get every vendor to spec
out the same hardware out the same hardware and software before you and software before you compare pricescompare prices
Compare service Compare service agreementsagreements
How fast can they deliver How fast can they deliver a working cluster?a working cluster?
Buying Commercial SoftwareBuying Commercial Software
Is it worth the Is it worth the money?money?
Is it proven software?Is it proven software? Are all the bells and Are all the bells and
whistles really whistles really necessary?necessary?
Paid software does Paid software does not necessarily have not necessarily have the best supportthe best support
Cluster TipsCluster Tips
Keep all sysadmin Keep all sysadmin scripts in an easily scripts in an easily accessible placeaccessible place /4sysadmin/4sysadmin /usr/local/4sysadmin /usr/local/4sysadmin
Cluster TipsCluster Tips
Force everyone to Force everyone to use the scheduler to use the scheduler to run their jobs (even run their jobs (even uniprocessor jobs)uniprocessor jobs) Police itPolice it Don’t let users get Don’t let users get
away with thingsaway with things Wrapping some Wrapping some
applications into a applications into a scheduler script can scheduler script can be trickybe tricky
Cluster UpgradesCluster Upgrades
Nodes become obsolete in 2 to 3 yearsNodes become obsolete in 2 to 3 years Upgrade banks of nodes at a timeUpgrade banks of nodes at a time If upgrading to a new CPU, check for If upgrading to a new CPU, check for
compatibility problems and new A/C compatibility problems and new A/C requirementsrequirements
Upgrading memory and disk space is easy Upgrading memory and disk space is easy but tediousbut tedious
Cluster UpgradesCluster Upgrades
Upgrading the OS can be a major taskUpgrading the OS can be a major task Even installing patches can be a major Even installing patches can be a major
tasktask
Common Sense Cluster Common Sense Cluster AdministrationAdministration
Plan a little before you do anythingPlan a little before you do anything Keep a journal of everything you doKeep a journal of everything you do Create procedures that are easy to follow Create procedures that are easy to follow
in times of stressin times of stress Document everything!Document everything!
Common Sense Cluster Common Sense Cluster AdministrationAdministration
Test software before announcing itTest software before announcing it Educate and “radiate” your cluster Educate and “radiate” your cluster
knowledge to your support teamknowledge to your support team
coffee.chemcoffee.chem
6 P.I.’s in Chemistry funded it6 P.I.’s in Chemistry funded it Located in FBA121 next to A/C3Located in FBA121 next to A/C3 69 dual-CPU node cluster69 dual-CPU node cluster
64 compute nodes64 compute nodes 1 login node1 login node 1 admin node1 admin node 1 I/O node1 I/O node 1 backup node1 backup node 1 firewall node1 firewall node
coffee.chemcoffee.chem
Myrinet on 32 compute nodes, gigE on Myrinet on 32 compute nodes, gigE on other 32other 32
2 TB RAID5 array (1.7 TB formatted) 2 TB RAID5 array (1.7 TB formatted)
12-slot, 4.8 TB capacity LTO tape 12-slot, 4.8 TB capacity LTO tape drivedrive
2U fold-out console with LCD monitor, 2U fold-out console with LCD monitor, keyboard, trackpadkeyboard, trackpad
coffee.chemcoffee.chem
5 KVM daisy chained switches5 KVM daisy chained switches 9 APC 3000 UPS units each connected to 9 APC 3000 UPS units each connected to
their own circuittheir own circuit 3 42U racks3 42U racks
coffee.chemcoffee.chem
Red Hat 9Red Hat 9 Felix cluster install and management Felix cluster install and management
softwaresoftware PBS ProPBS Pro MPICH, LAM/MPI, MPICH-GMMPICH, LAM/MPI, MPICH-GM gnu and Portland Group compilersgnu and Portland Group compilers BLAS, SCALAPACK, ATLAS librariesBLAS, SCALAPACK, ATLAS libraries Gaussian98 (Gaussian03 + Linda soon)Gaussian98 (Gaussian03 + Linda soon)
coffee.chemcoffee.chem
/data on I/O node (coffeecompute00) /data on I/O node (coffeecompute00) holds common apps and user home holds common apps and user home directoriesdirectories
Admin node (coffeeadmin) runs Felix Admin node (coffeeadmin) runs Felix cluster managercluster manager
Compute nodes (coffeecompute01..64)Compute nodes (coffeecompute01..64) Every node in the cluster can access /data Every node in the cluster can access /data
via NFSvia NFS
coffee.chemcoffee.chem
Can ssh into compute nodes, admin, and Can ssh into compute nodes, admin, and I/O node only via login nodeI/O node only via login node
Backup node (javabean) temporarily has Backup node (javabean) temporarily has our backup device attached (we use tar our backup device attached (we use tar right now)right now)
Logging Into coffee.chemLogging Into coffee.chem
Everyone in this room will have user Everyone in this room will have user accounts on coffee.chem and home accounts on coffee.chem and home directories in /data/staffdirectories in /data/staff
Our existence on the system is for Our existence on the system is for Chemistry’s benefitChemistry’s benefit
Support scripts are found in /4sysadminSupport scripts are found in /4sysadmin If a reboot is necessary, make sure that If a reboot is necessary, make sure that
PBS is started (/etc/init.d/pbs start)PBS is started (/etc/init.d/pbs start)
Compiling and Running CodeCompiling and Running Code
pgCC -Mmpi -o test hello.cpppgCC -Mmpi -o test hello.cpp mpirun -np 8 testmpirun -np 8 test
Compiling CodeCompiling Code
pgCC -Mmpi -o test hello.cpppgCC -Mmpi -o test hello.cpp MPICH includes mpicc and mpif77 to MPICH includes mpicc and mpif77 to
compile and link MPI programscompile and link MPI programs Scripts that pass the MPI library arguments to Scripts that pass the MPI library arguments to
cc and f77cc and f77
Running CodeRunning Code
mpirun -np XXX -machinefile YYY -nolocal mpirun -np XXX -machinefile YYY -nolocal testtest -np = number of processors-np = number of processors -machinefile = filename with list of processors -machinefile = filename with list of processors
you want to run job onyou want to run job on -nolocal = don’t run the job locally-nolocal = don’t run the job locally
Submitting a JobSubmitting a Job
3 queues to choose from3 queues to choose from CoffeeqCoffeeq
general purpose queuegeneral purpose queue 12 hours max run time12 hours max run time 16 processors max16 processors max
EspressoqEspressoq Higher priority than coffeeqHigher priority than coffeeq 3 weeks max run time3 weeks max run time
Some may still use piq, but this will go away Some may still use piq, but this will go away soonsoon
Submitting a JobSubmitting a Job
Prepare a scheduler scriptPrepare a scheduler script #!/bin/tcsh#!/bin/tcsh #PBS -l arch=linux#PBS -l arch=linux {define architecture}{define architecture} #PBS -l cput=1:00:00#PBS -l cput=1:00:00 {define CPU time needed}{define CPU time needed} #PBS -l mem=400mb#PBS -l mem=400mb {define memory space needed}{define memory space needed} #PBS -l nodes=64:ppn=1#PBS -l nodes=64:ppn=1 {define number of nodes needed}{define number of nodes needed} #PBS -m e#PBS -m e {mail me the results}{mail me the results} #PBS -c c#PBS -c c {minimal checkpointing}{minimal checkpointing} #PBS -k oe#PBS -k oe {keep the output and errors}{keep the output and errors} #PBS -q coffeeq#PBS -q coffeeq {run the job on coffeeq}{run the job on coffeeq} mpirun -np 8 -machinefile machines_gige_32.LINUX mpirun -np 8 -machinefile machines_gige_32.LINUX
/data/staff/din/newhello/data/staff/din/newhello
qsub the scheduler scriptqsub the scheduler script
More PBS CommandsMore PBS Commands
Check on the status of all submitted jobs Check on the status of all submitted jobs with: qstatwith: qstat
Submit a job with: qsubSubmit a job with: qsub Delete a job with: qdelDelete a job with: qdel Terminate the execution of a job with: Terminate the execution of a job with:
qtermqterm See all your available compute node See all your available compute node
resources with: pbsnodes -aresources with: pbsnodes -a
Node TermsNode Terms
Login node = Service node = Head node = the node users log intoLogin node = Service node = Head node = the node users log into Master scheduler node = node where scheduler runs, usually login nodeMaster scheduler node = node where scheduler runs, usually login node Admin node = the node the sysadmin logs into to gain access to cluster Admin node = the node the sysadmin logs into to gain access to cluster
management appsmanagement apps Compute node = one or more nodes that perform pieces of a larger Compute node = one or more nodes that perform pieces of a larger
computationcomputation Storage node = the node that has the RAID array or SAN attached to itStorage node = the node that has the RAID array or SAN attached to it Backup node = the node that has the backup solution attached to itBackup node = the node that has the backup solution attached to it I/O node = can combine features of storage and backup nodesI/O node = can combine features of storage and backup nodes Visualization node = the node that contains a graphics card and Visualization node = the node that contains a graphics card and
graphics console; multiple visualization nodes can be combined in a graphics console; multiple visualization nodes can be combined in a matrix to form a video wallmatrix to form a video wall
Spare node = nodes that are not in service, but can be rebuilt to take the Spare node = nodes that are not in service, but can be rebuilt to take the place of a compute node or, in some cases, an admin or login nodeplace of a compute node or, in some cases, an admin or login node
ReferencesReferences
Bookman, Charles. Bookman, Charles. Linux Clustering: Building and Maintaining Linux Clustering: Building and Maintaining Linux ClustersLinux Clusters. New Riders, Indianapolis, Indiana, 2003.. New Riders, Indianapolis, Indiana, 2003.
Howse, Martin. "Dropping the Bomb: AMD Opteron" in Howse, Martin. "Dropping the Bomb: AMD Opteron" in Linux User Linux User & Developer& Developer, Issue 33. pp 33-36., Issue 33. pp 33-36.
Robertson, Alan. "Highly-Affordable High Availability" in Robertson, Alan. "Highly-Affordable High Availability" in Linux Linux MagazineMagazine, November 2003. pp 16-21., November 2003. pp 16-21.
The Seventh LCI Workshop Systems Track Notes. The Seventh LCI Workshop Systems Track Notes. Linux Linux Clusters Institute, March 24-28, 2003.Clusters Institute, March 24-28, 2003.
Sterling, Thomas et al. Sterling, Thomas et al. How to Build a Beowulf: A Guide to the How to Build a Beowulf: A Guide to the Implementation and Application of PC ClustersImplementation and Application of PC Clusters. The MIT Press, . The MIT Press, Cambridge, Massachusetts, 1999.Cambridge, Massachusetts, 1999.
Vrenios, Alex. Vrenios, Alex. Linux Cluster ArchitectureLinux Cluster Architecture. Sams Publishing, . Sams Publishing, Indianapolis, Indiana, 2002.Indianapolis, Indiana, 2002.
coffee.chem Contact Listcoffee.chem Contact List
Dell hardware problems 800-234-1490Dell hardware problems 800-234-1490 Myrinet problems Myrinet problems [email protected] ““Very limited” software support Very limited” software support
[email protected] PGI Compiler issues PGI Compiler issues [email protected]
Introduction toIntroduction toLinux ClustersLinux Clusters
Clare DinClare Din
SAS ComputingSAS Computing
University of PennsylvaniaUniversity of Pennsylvania
March 15, 2004March 15, 2004