Discover Cluster Upgrades: Hello Haswells and SLES11 SP3, Goodbye Westmeres February 3, 2015 NCCS...

Discover Cluster Upgrades:

Hello Haswells and SLES11 SP3, Goodbye Westmeres

February 3, 2015NCCS Brown Bag

NASA Center for Climate Simulation

Agenda

• Discover Cluster Hardware Changes & Schedule – Brief Update

• Using Discover SCU10 Haswell / SLES11 SP3

• Q & A

Discover: Haswells & SLES11 SP3, Feb. 3, 2015 2


Discover Hardware Changes & Schedule update



Discover’s New Intel Xeon“Haswell” Nodes

• Discover’s Intel Xeon “Haswell” nodes:• 28 cores per node, 2.6 GHz

• Usable memory: 120 GB per node, ~4.25 GB per core (128 GB total)

• FDR InfiniBand (56 Gbps), 1:1 blocking

• SLES11 SP3

• NO SWAP space, but DO have lscratch and shmem disk space

• SCU10:– 720* Haswell nodes general use (1,080 nodes total), 30,240 cores

total, 1,229 TFLOPS peak total

• *Up to 360 of the 720 nodes may be episodically allocated for priority work

• SCU11: – ~600 Haswell nodes, 16,800 cores total, 683 TFLOPS peak



Discover Hardware Changes in a Nutshell

• January 30, 2015 (-70 TFLOPS):

– Removed: 516 Westmere (12-core) nodes (SCU3, SCU4)

• February 2, 2015 (+806 TFLOPS for general work):

– Added: ~720* Haswell (28-core) nodes (2/3 of SCU10)

• *Up to 360 of the 720 nodes may be episodically allocated to a priority project

• Week of February 9, 2015 (-70 TFLOPS):

– Removed: 516 Westmere (12-core) nodes (SCU1, SCU2)

– Removed: 7 oldest (‘Dunnington’) Dalis (dali02-dali08)

• Late February/early March 2015 (+713 TFLOPS for general work):

– Added: 600 Haswell (28-core) nodes (SCU11)


TFLOPS for General User Work


Discover Node Count for General Work – Fall/Winter Evolution



Discover Processor Cores for General Work – Fall/Winter Evolution



Oldest Dali Nodes to Be Decommissioned

• The oldest Dali nodes (dali02 – dali08) will be decommissioned starting February 9 (plenty of newer Dali nodes remain).

• You should see no impact from the decommissioning of old Dali nodes, provided you have not been explicitly specifying one of the dali02 – dali08 node names when logging in.



Using Discover SCU10 and Haswell / SLES11 SP3



How to use SCU10

• 720 Haswell nodes on SCU10 available in sp3 partition

• To be placed on a login node with the SP3 development environment, after providing your NCCS LDAP password, specify “discover-sp3” at the “Host” prompt:

Host: discover-sp3

• However, you may submit to the sp3 partition from any login node.



How to use SCU10

• To submit a job to the sp3 partition, use either:– Command line:

sbatch --partition=sp3 --constraint=hasw myjob.sh

– Or inline directives:#SBATCH --partition=sp3

#SBATCH --constraint=hasw



Porting your work: the fine print…

• There is a small (but non-zero) chance your scripts and binaries will run with no changes at all.

• Nearly all scripts and binaries will require changes to make best use of SCU10.



Porting your work: the fine print…

• There is a small (but non-zero) chance your scripts and binaries will run with no changes at all.

• Nearly all scripts and binaries will require changes to make best use of SCU10, sooo…

With great power comes great responsibility.

- Ben Parker (2002)



Adjust for new core count

• Haswell nodes have 28 cores, 128 GB– >x2 memory/core from Sandy Bridge

• Specify total cores/tasks needed, not nodes.– Example: for Sandy Bridge nodes:

#SBATCH --ntasks=800

Not

#SBATCH --nodes=50

• This allows SLURM to allocate whatever resources are available.



If you must control the details…

• … still don’t use --nodes.

• If you need more than ~4 GB/core, use fewer cores/node.#SBATCH --ntasks-per-node=N…

–Assumes 1 task/core (the usual case).

• Or specify required memory:#SBATCH --mem-per-cpu=N_MB…

• SLURM will figure out how many nodes are needed to meet this specification.



Script changes summary

• Avoid specifying --partition unless absolutely necessary.– And sometimes not even then…

• Avoid specifying --nodes.– Ditto.

• Let SLURM do the work for you.– That’s what it’s there for, and it allows for better

resource utilization.



Source code changes

• You might not need to recompile…– … but SP3 upgrade may require it.

• SCU10 hardware is brand-new, possibly needing a recompile.– New features, e.g. AVX2 vector registers– SGI nodes, not IBM– FDR vs QDR Infiniband– NO SWAP SPACE!



And did I mention…

• … NO SWAP SPACE!

• This is critical.– When you run out of memory now, you won’t start

to swap – your code will throw an exception.

• Ameliorated by higher GB/core ratio…– … but we still expect some problems from this.

• Use policeme to monitor the memory requirements of your code.



If you do recompile…

• Current working compiler modules:– All Intel C compilers (ifortran not tested yet)– gcc 4.5, 4.81, 4.91– g95 0.93

• Current working MPI modules:– SGI MPT– Intel 4.1.1.036 and later– MVAPICH2 1.81, 1.9, 1.9a, 2.0, 2.0a, 2.1a– OpenMPI: 1.8.1, 1.8.2, 1.8.3



MPI “gotchas”

• Programs using old Intel MPI must be upgraded.

• MVAPICH2 and OpenMPI have only been tested on single-node jobs.

• All MPI modules (except SGI MPT) may experience stability issues when node counts are >~300.– Symptom: Abnormally long MPI teardown times.



cron jobs

• discover-cron is still at SP1.– When running SP3-specific code, may need to ssh

to SP3 node for proper execution.– Not extensively tested yet.



Sequential job execution

• Jobs may not execute in submission order.– Small and interactive jobs favored during the day.– Large jobs favored at night.

• If execution order is important, the dependencies must be specified to SLURM.

• Multiple dependencies can be specified with the --dependency option.– Can depend on start, end, failure, error, etc.



Dependency example

# String to hold the job IDs.

job_ids=''

# Submit the first parallel processing job, save the job ID.

job_id=`sbatch job1.sh | cut -d ' ' -f 4`

job_ids="$job_ids:$job_id"

# Submit the second parallel processing job, save the job ID.



# Submit the third parallel processing job, save the job ID.



# Wait for the processing jobs to finish successfully, then

# run the post-processing job.

sbatch --dependency=afterok$job_ids postjob.sh



Coming attraction: shared nodes

• SCU10 nodes will initially be exclusive: 1 job/node

• This is how we roll on discover now.

• May leave a lot of unused cores and/or memory.

• Eventually, SCU10 nodes (and maybe others) will be shared among jobs.– Same or different users.

• What does this mean?Discover: Haswells & SLES11 SP3, Feb. 3, 2015 24


Shared nodes (future)

• You will no longer be able to assume that all of the node resources are for you.

• Specifying task and memory requirements will ensure SLURM gets you what you need.

• Your jobs must learn to “work and play well with others”.– Unexpected job interactions, esp. with I/O, may

cause unusual behavior when nodes are shared.



Shared nodes (future, continued)

• If you absolutely must have a minimum number of CPUs in a node, the --mincpus=N option to sbatch will ensure you get it.


Questions & Answers

NCCS User Services:[email protected]

301-286-9120

https://www.nccs.nasa.gov

Thank you


Supplemental Slides



Discover Compute Nodes, February 3, 2015 (Peak ~1,629 TFLOPS)

• “Haswell” nodes, 28 cores per node, 2.6 GHz (new)

– SLES11 SP3

– SCU10, 4.5 GB memory per core (new)

• 720* nodes general use (1,080 nodes total), 30,240 cores total, 1,229 TFLOPS peak total (*360 nodes episodically allocated for priority work)

• “Sandy Bridge” nodes, 16 cores per node, 2.6 GHz (no change)– SLES11 SP1

– SCU8, 2 GB memory per core

• 480 nodes, 7,680 cores, 160 TFLOPS peak



• “Westmere” nodes, 12 cores per node, 2 GB memory per core, 2.6 GHz– SLES11 SP1

– SCU1, SCU2 (SCUS 3, 4, and 7 already removed)

• 516 nodes, 6,192 cores total, 70 TFLOPS peakDiscover: Haswells & SLES11 SP3, Feb. 3, 2015 29


Discover Compute Nodes, March 2015 (Peak ~2,200 TFLOPS)

• “Haswell” nodes, 28 cores per node

– SLES11 SP3

– SCU10, 4.5 GB memory per core

• 720* nodes general use (1,080 nodes total), 30,240 cores total, 1,229 TFLOPS peak total (*360 nodes episodically allocated for priority work)

– SCU11, 4.5 GB memory per core (new)

• ~600 nodes, 16,800 cores total, 683 TFLOPS peak

• “Sandy Bridge” nodes, 16 cores per node (no change)– SLES11 SP1





• No remaining “Westmere” nodesDiscover: Haswells & SLES11 SP3, Feb. 3, 2015 30


Jan.26-30Jan.

26-30Feb. 2-

6Feb. 2-

6Feb. 9-

13Feb. 9-

13Feb. 17-20Feb. 17-20

Feb. 23-27Feb. 23-27

Mar.2-27Mar.2-27

SCU10 Integration SCU10 GeneralAccess: +720*

Nodes

SCU10 arrived in mid-November 2014. Following installation & resolution of initial power issues, the NCCS provisioned SCU10 with Discover images and integrated it with GPFS storage. NCCS stress testing and targeted high-priority use occurred in January 2015.(*360 nodes episodically allocated for priority work)

SCU 8 and 9

No changes during this period (January – March 2015). In November 2014, 480 nodes previously allocated for a high-priority project were made available for all user processing.

SCU11 Integration

SCU 11 (600 Haswell nodes) has been delivered, and will be installed starting Feb. 9th. Then the NCCS will provision the system with Discover images and integrate it with GPFS storage. Power and I/O connections from Westmere SCUs 1, 2, 3, and 4 are needed for SCU11. Thus, SCUs 1, 2, 3, and 4 must be removed prior to SCU11 integration.

SLES11, SP3600 Nodes

16,800 CoresIntel Haswell683 TF Peak

SLES11, SP1960 Nodes

15,360 CoresIntel Sandy

Bridge320 TF Peak

SLES11, SP11,032 Nodes12,384 Cores

Intel Westmere139 TF Peak

SCU 1, 2, 3, 4Decommissioning Drain: 516

Nodes

To make room for the new SCU11 compute nodes, the nodes of Scalable Units 1, 2, 3, and 4 (12-core Westmeres installed in 2011) are being removed from operations during February. Removal of half of these nodes will coincide with the general access to SCU10, the remaining half during installation of SCU11.

SLES11, SP31,080 Nodes30,240 CoresIntel Haswell

1,229 TF Peak

Discover COMPUTE


Drain: 516Nodes

Remove: 516Nodes

Remove: 516Nodes

Physical Installation

Configur-ation

StressTesting

SCU11 GeneralAccess: +600

Nodes


Discover “SBU” Computational Capacity for General Work – Fall/Winter Evolution



Total Discover Peak Computing Capability as a Function of Time (Intel Xeon Processors Only)



Total Number of Discover Intel Xeon Processor Cores as a Function of Time



Storage Augmentations

• Dirac (Mass Storage) Disk Augmentation– 4 Petabytes usable (5 Petabytes “raw”), installed

– Gradual data move: starts week of February 9 (many files, “inodes” to move)

• Discover Storage Expansion– 8 Petabytes usable (10 Petabytes “raw”), installed

– For both general use and targeted “Climate Downscaling” project

– Phased deployment, including optimizing the arrangement existing project and user nobackup space


Discover Cluster Upgrades: Hello Haswells and SLES11 SP3, Goodbye Westmeres February 3, 2015 NCCS...

Documents

Transcript of Discover Cluster Upgrades: Hello Haswells and SLES11 SP3, Goodbye Westmeres February 3, 2015 NCCS...