Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

24
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

description

Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab. Motivation. NERSC’s focus is on capability computation Capability == jobs that use ¼ or more of the machines resources - PowerPoint PPT Presentation

Transcript of Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

Page 1: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

1

Scaling Up User Codes on the SP

David Skinner, NERSC Division, Berkeley Lab

Page 2: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

2

Motivation

• NERSC’s focus is on capability computation– Capability == jobs that use ¼ or more of the machines resources

• Scientists whose work involves large scale computation or HPC should keep ahead of workstation sized problems

• “Big Science” problems are more interesting!

Page 3: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

3

Challenges

• CPU’s are outpacing memory bandwidth and switches, leaving FLOPs increasingly isolated.

• Vendors often have machines < ½ the size of NERSC machines: system software may be operating in uncharted regimes– MPI implementation

– Filesystem metadata systems

– Batch queue system

Users need information on how to mitigate the impact of these issues for large concurrency applications.

Page 4: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

4

Seaborg.nersc.gov

MP_EUIDEVICE

(switch fabric)

MPI Bandwidth

(MB/sec)

MPI Latency

(usec)

css0 500 / 350 8 / 16

css1

csss 500 / 350

(single task)

8 / 16

Page 5: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

5

Switch Adapater Performance

csss

css0

Page 6: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

6

Switch considerations

• For data decomposed applications with some locality partition problem along SMP boundaries (minimize surface to volume ratio)

• Use MP_SHAREDMEMORY to minimize switch traffic

• csss is most often the best route to the switch

Page 7: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

7

Synchronization

• On the SP each SMP image is scheduled independently and while use code is waiting, OS will schedule other tasks

• A fully synchronizing MPI call requires everyone’s attention

• By analogy, imagine trying to go to lunch with 1024 people

• Probability that everyone is ready at any given time scales poorly

Page 8: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

8

Synchronization (continued)

• MPI_Alltoall and MPI_Allreduce can be particularly bad in the range of 512 tasks and above

• Use MPI_Broadcast if possible – Not fully synchronizing

• Remove un-needed MPI_Barrier calls

• Use Asynchronous I/O when possible

Page 9: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

9

Load Balance

• If one task lags the others in time to complete synchronization suffers, e.g. a 3% slowdown in one task can mean a 50% slowdown for the code overall

• Seek out and eliminate sources of variation• Distribute problem uniformly among nodes/cpus

0 20 40 60 80 100

0

1

2

3 FLOPI/OSYNCFLOPI/OSYNC

Page 10: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

10

Alternatives to MPI

• CHARM++ and NAMD – Spatially decomposed molecular dynamics with

periodic load balancing, data decomposition is adaptive

• AMPI http://charm.cs.uiuc.edu/– An automatic approach to load balancing

• BlueGene L type machines with > 10K cpus will need re-examine these issues altogether

Page 11: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

11

Improving MPI Scaling on Seaborg

Page 12: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

12

The SP switch

• Use MP_SHAREDMEMORY=yes (default)

• Use MP_EUIDEVICE=csss for 32 bit applications

(default)

• Run /usr/common/usg/bin/phost prior to your parallel program to map machine names to POE tasks– MPI and LAPI versions available

– Hostslists are useful in general

Page 13: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

13

64 bit MPI

• 32 bit MPI has inconvenient memory limits – 256MB per task default and 2GB maximum

– 1.7GB can be used in practice, but depends on MPI usage

– The scaling of this internal usage is complicated, but larger concurrency jobs have more of their memory “stolen” by MPI’s

internal buffers and pipes

• 64 bit MPI removes these barriers– But must run on css0 only, less switch bandwidth

• Seaborg has 16,32, and 64 GB per node available

Page 14: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

14

64 bit MPI Howto

At compile time:

* module load mpi64 * compile with the "-q64" option using mpcc_r, mpxlf_r, or mpxlf90_r.

At run time:

* module load mpi64 * use "#@ network.MPI = css0,us,shared" in your job scripts. The multilink adapter "csss" is not currently supported. * run your POE code as you normally would

Page 15: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

15

MP_LABELIO, phost

• Labeled I/O will let you know which task generated the message “segmentation fault” , gave wrong answer, etc.

export MP_LABELIO=yes

• Run /usr/common/usg/bin/phost prior to your parallel program to map machine names to POE tasks– MPI and LAPI versions available

– Hostslists are useful in general

Page 16: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

16

Core files

• Core dumps don’t scale (no parallel work)

• MP_COREDIR=/dev/null No corefile I/O• MP_COREFILE_FORMAT=light_core Less I/O• LL script to save just one full fledged core file, throw away

others … if MP_CHILD !=0 export MP_COREDIR=/dev/nullendif…

Page 17: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

17

Debugging

• In general debugging 512 and above is error prone and cumbersome.

• Debug at a smaller scale when possible.

• Use shared memory device MPICH on a workstation with lots of memory to simulate 1024 cpus.

• For crashed jobs examine LL logs for memory usage history.

Page 18: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

18

Parallel I/O

• Can be a significant source of variation in task completion prior to synchronization

• Limit the number of readers or writers when appropriate. Pay attention to file creation rates.

• Output reduced quantities when possible

Page 19: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

19

OpenMP

• Using a mixed model, even when no underlying fine grained parallelism is present can take strain off of the MPI implementation,

e.g. on seaborg a 2048 way job can run with only 128 MPI tasks and 16 OpenMP threads

• Having hybrid code whose concurrencies can be tuned between MPI and OpenMP tasks has portability advantages

Page 20: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

20

Summary

• Resources are present to face the challenges posed by scaling up MPI applications on seaborg.

• Scientists should expand their problem scopes to tackle increasingly challenging computational problems.

• NERSC consultants can provide help in achieving scaling goals.

Page 21: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

21

Page 22: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

22

Page 23: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

23

Page 24: Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

24