Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

48
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER 1 Scaling Up MPI and MPI- I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab

description

Scaling Up MPI and MPI-I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab. Scaling: Motivation. NERSC’s focus is on capability computation Capability == jobs that use ¼ or more of the machines resources Parallelism can deliver scientific results unattainable on workstations. - PowerPoint PPT Presentation

Transcript of Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

Page 1: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

1

Scaling Up MPI and MPI-I/O onseaborg.nersc.gov

David Skinner, NERSC Division, Berkeley Lab

Page 2: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

2

Scaling: Motivation

• NERSC’s focus is on capability computation– Capability == jobs that use ¼ or more of the machines resources

• Parallelism can deliver scientific results unattainable on workstations.

• “Big Science” problems are more interesting!

Page 3: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

3

Scaling: Challenges

• CPU’s are outpacing memory bandwidth and switches, leaving FLOPs increasingly isolated.

• Vendors often have machines < ½ the size of NERSC machines: system software may be operating in uncharted regimes– MPI implementation

– Filesystem metadata systems

– Batch queue system

• NERSC consultants can help

Users need information on how to mitigate the impact of these issues for large concurrency applications.

Page 4: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

4

Seaborg.nersc.gov

MP_EUIDEVICE

(switch fabric)

MPI Bandwidth

(MB/sec)

MPI Latency

(usec)

css0 500 / 350 8 / 16

css1

csss 500 / 350

(single task)

8 / 16

Page 5: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

5

Switch Adapter Bandwidth: csss

Page 6: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

6

Switch Adapter Comparison

csss

css0

Tune messagesize to optimizethroughput

Page 7: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

7

Switch Adapter Considerations

• For data decomposed applications with some locality partition problem along SMP boundaries (minimize surface to volume ratio)

• Use MP_SHAREDMEMORY to minimize switch traffic

• csss is most often the best route to the switch

Page 8: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

8

Job Start Up times

Page 9: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

9

Synchronization

• On the SP each SMP image is scheduled independently and while use code is waiting, OS will schedule other tasks

• A fully synchronizing MPI call requires everyone’s attention

• By analogy, imagine trying to go to lunch with 1024 people

• Probability that everyone is ready at any given time scales poorly

Page 10: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

10

Scaling of MPI_Barrier()

Page 11: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

11

Load Balance

• If one task lags the others in time to complete synchronization suffers, e.g. a 3% slowdown in one task can mean a 50% slowdown for the code overall

• Seek out and eliminate sources of variation• Distribute problem uniformly among nodes/cpus

0 20 40 60 80 100

0

1

2

3 FLOPI/OSYNCFLOPI/OSYNC

Page 12: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

12

Synchronization: MPI_Bcast 2048 tasks

Page 13: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

13

Synchronization: MPI_Alltoall 2048 tasks

Page 14: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

14

Synchronization (continued)

• MPI_Alltoall and MPI_Allreduce can be particularly bad in the range of 512 tasks and above

• Use MPI_Bcast if possible which is not fully synchronizing

• Remove un-needed MPI_Barrier calls

• Use Immediate Sends and Asynchronous I/O when possible

Page 15: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

15

Improving MPI Scaling on Seaborg

Page 16: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

16

The SP switch

• Use MP_SHAREDMEMORY=yes (default)

• Use MP_EUIDEVICE=csss (default)

• Tune message sizes

• Reduce synchronizing MPI calls

Page 17: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

17

64 bit MPI

• 32 bit MPI has inconvenient memory limits – 256MB per task default and 2GB maximum– 1.7GB can be used in practice, but depends on MPI usage – The scaling of this internal usage is complicated, but larger

concurrency jobs have more of their memory “stolen” by MPI’s internal buffers and pipes

• 64 bit MPI removes these barriers– 64 bit MPI is fully supported– Just remember to use “_r” compilers and “-q64”

• Seaborg has 16,32, and 64 GB per node available

Page 18: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

18

How to measure MPI memory usage?

2048tasks

Page 19: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

19

MP_PIPE_SIZE : 2*PIPE_SIZE*(ntasks-1)

Page 20: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

20

OpenMP

• Using a mixed model, even when no underlying fine grained parallelism is present can take strain off of the MPI implementation,

e.g. on seaborg a 2048 way job can run with only 128 MPI tasks and 16 OpenMP threads

• Having hybrid code whose concurrencies can be tuned between MPI and OpenMP tasks has portability advantages

Page 21: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

21

Beware Hidden Multithreading

• ESSL and IBM Fortran have autotasking like “features” which function via creation of unspecified numbers of threads.

• Fortran RANDOM_NUMBER intrinsic has some well known scaling problems.

http://www.nersc.gov/projects/scaling/random_number.html

• XLF, use threads to auto parallelize my code “-qsmp=auto”.

ESSL, libesslsmp.a has an autotasking feature

• Synchronization problems are unpredictable using these features. Performance impacted when too many threads.

Page 22: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

22

MP_LABELIO, phost

• Labeled I/O will let you know which task generated the message “segmentation fault” , gave wrong answer, etc.

export MP_LABELIO=yes

• Run /usr/common/usg/bin/phost prior to your parallel program to map machine names to POE tasks– MPI and LAPI versions available

– Hostslists are useful in general

Page 23: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

23

Core files

• Core dumps don’t scale (no parallel work)

• MP_COREDIR=none No corefile I/O• MP_COREFILE_FORMAT=light_core Less I/O• LL script to save just one full fledged core file, throw away

others … if MP_CHILD !=0 export MP_COREDIR=/dev/nullendif…

Page 24: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

24

Debugging

• In general debugging 512 and above is error prone and cumbersome.

• Debug at a smaller scale when possible.

• Use shared memory device MPICH on a workstation with lots of memory as a mock up high concurrency environment.

• For crashed jobs examine LL logs for memory usage history.

(ask a NERSC consultant for help with this)

Page 25: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

25

Parallel I/O

• Can be a significant source of variation in task completion prior to synchronization

• Limit the number of readers or writers when appropriate. Pay attention to file creation rates.

• Output reduced quantities when possible

Page 26: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

26

Summary

• Resources are present to face the challenges posed by scaling up MPI applications on seaborg.

• Hopefully, scientists will expand their problem scopes to tackle increasingly challenging computational problems.

• NERSC consultants can provide help in achieving scaling goals.

Page 27: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

27

Page 28: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

28

Scaling of Parallel I/O on GPFS

Page 29: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

29

Motivation

• NERSC uses GPFS for $HOME and $SCRATCH

• Local disk filesystems on seaborg (/tmp) are tiny

• Growing data sizes and concurrencies often outpace I/O methodologies

Page 30: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

30

[email protected]

Each compute node relieson the GPFS nodes as gateways to storage

16 nodes are dedicated toserving GPFS filesystems

Page 31: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

31

Common Problems when Implementing Parallel IO

• CPU utilization suffers as time is lost to I/O

• Variation in write times can be severe, leading to batch job failure

Time to write 100GB

0100200300400500

1 2 3 4iteration

Tim

e (s

)

Page 32: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

32

Finding solutions

• Checkpoint (saving state) IO pattern

• Survey strategies to determine the rate and variation in rate

Page 33: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

33

Page 34: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

34

Parallel I/O Strategies

Page 35: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

35

Multiple File I/O

if(private_dir) rank_dir(1,rank); fp=fopen(fname_r,"w"); fwrite(data,nbyte,1,fp); fclose(fp); if(private_dir) rank_dir(0,rank); MPI_Barrier(MPI_COMM_WORLD);

Page 36: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

36

Single File I/O

fd=open(fname,O_CREAT|O_RDWR, S_IRUSR);

lseek(fd,(off_t)(rank*nbyte)-1,SEEK_SET);

write(fd,data,1);

close(fd);

Page 37: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

37

MPI-I/O

MPI_Info_set(mpiio_file_hints, MPIIO_FILE_HINT0);MPI_File_open(MPI_COMM_WORLD, fname, MPI_MODE_CREATE | MPI_MODE_RDWR, mpiio_file_hints, &fh);MPI_File_set_view(fh, (off_t)rank*(off_t)nbyte, MPI_DOUBLE, MPI_DOUBLE, "native", mpiio_file_hints);MPI_File_write_all(fh, data, ndata, MPI_DOUBLE, &status);MPI_File_close(&fh);

Page 38: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

38

Results

Page 39: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

39

Scaling of single file I/O

Page 40: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

40

Scaling of multiple file and MPI I/O

Page 41: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

41

Large block I/O

• MPI I/O on the SP includes the file hint IBM_largeblock_io

• IBM_largeblock_io=true used throughout, default values show large variation

• IBM_largeblock_io=true also turns off data shipping

Page 42: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

42

Large block I/O = false

• MPI on the SP includes the file hint IBM_largeblock_io

• Except above IBM_largeblock_io=true used throughout

• IBM_largeblock_io=true also turns off data shipping

Page 43: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

43

Bottlenecks to scaling

• Single file I/O has a tendency to serialize

• Scaling up with multiple files create filesystem problems

• Akin to data shipping consider the intermediate case

Page 44: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

44

Parallel IO with SMP aggregation (32 tasks)

Page 45: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

45

Parallel IO with SMP aggregation (512 tasks)

Page 46: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

46

Summary

2048

1024

512

256

128

64

32

16

1

MB

10

MB

100

MB

1

GB

10

G

100

G

Serial

Multiple File

Multiple File

mod n

MPI IO

MPI IO collective

Page 47: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

47

Page 48: Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

48