Parallel Io Course
-
Upload
azeraraeeazeazer -
Category
Documents
-
view
217 -
download
0
Transcript of Parallel Io Course
-
8/2/2019 Parallel Io Course
1/75
Parallel I/O
SciNet
www.scinet.utoronto.ca
University of Toronto
Toronto, Canada
October 19, 2010
-
8/2/2019 Parallel Io Course
2/75
Outline / Schedule
1 Introduction
2 File Systems and I/O
3 Data Management
4 Break
5 Parallel I/O
6 MPI-IO
7 HDF5/NETCDF
-
8/2/2019 Parallel Io Course
3/75
Disk I/O
Common Uses
Checkpoint/Restart Files
Data AnalysisData Organization
Time accurate and/or Optimization Runs
Batch and Data processing
Database
-
8/2/2019 Parallel Io Course
4/75
Disk I/O
Common Bottlenecks
Mechanical disks are slow!
System call overhead (open, close, read, write)Shared file system (nfs, lustre, gpfs, etc)
HPC systems typically designed for high bandwidth (GB/s)not IOPs
Uncoordinated independent accesses
-
8/2/2019 Parallel Io Course
5/75
Disk Access Rates over Time
Figure by R. Ross, Argonne National Laboratory, CScADS09
-
8/2/2019 Parallel Io Course
6/75
Memory/Storage Latency
Figure by R. Freitas and L Chiu, IBM Almaden Labs, FAST10
-
8/2/2019 Parallel Io Course
7/75
Definitions
IOPsInput/Output Operations Per Second (read,write,open,close,seek)
I/O Bandwidth
Quantity you read/write (think network bandwidth)
Comparisons
Device Bandwidth (MB/s) per-node IOPs per-node
SATA HDD 100 100 100 100SSD HDD 250 250 4000 4000SciNet 5000 1.25 30000 7.5
-
8/2/2019 Parallel Io Course
8/75
SciNet Filesystem
File System
1,790 1TB SATA disk drives, for a totalof 1.4PB
Two DCS9900 couplets, each delivering:4-5 GB/s read/write access (bandwidth)30,000 IOPs max (open, close, seek, . . . )
Single GPFS file system on TCS and GPC
I/O goes over Gb ethernet network on GPC(infiniband on TCS)
File system is parallel!
-
8/2/2019 Parallel Io Course
9/75
I/O Software Stack
-
8/2/2019 Parallel Io Course
10/75
Parallel File System
-
8/2/2019 Parallel Io Course
11/75
Parallel File System
-
8/2/2019 Parallel Io Course
12/75
Parallel File System
-
8/2/2019 Parallel Io Course
13/75
Parallel File System
S
-
8/2/2019 Parallel Io Course
14/75
Parallel File System
P ll l Fil S
-
8/2/2019 Parallel Io Course
15/75
Parallel File System
P ll l Fil S
-
8/2/2019 Parallel Io Course
16/75
Parallel File System
P ll l Fil S t
-
8/2/2019 Parallel Io Course
17/75
Parallel File System
P ll l Fil S t
-
8/2/2019 Parallel Io Course
18/75
Parallel File System
File Locks
Most parallel file systems use locks to manage concurrent fileaccess
Files are broken up into lock units
Clients obtain locks on units that they will access before I/Ooccurs
Enables caching on clients as well (as long as client has a
lock, it knows its cached data is valid)Locks are reclaimed from clients when others desire access
P ll l Fil S st
-
8/2/2019 Parallel Io Course
19/75
Parallel File System
Optimal for large shared files.
Behaves poorly under many small reads and writes, high IOPs
Your use of it affects everybody!(Different from case with CPU and RAM which are not
shared.)How you read and write, your file format, the number of filesin a directory, and how often you ls, affects every user!
The file system is shared over the ethernet network on GPC:
Hammering the file system can hurt process communications.File systems are not infinite!Bandwidth, metadata, IOPs, number of files, space, . . .
Parallel File System
-
8/2/2019 Parallel Io Course
20/75
Parallel File System
2 jobs doing simultaneous I/O can take much longer than
twice a single job duration due to disk contention anddirectory locking.
SciNet: 500+ users doing I/O from 4000 nodes.Thats a lot of sharing and contention!
I/O Best Practices
-
8/2/2019 Parallel Io Course
21/75
I/O Best Practices
Make a plan
Make a plan for your data needs:How much will you generate,How much do you need to save,And where will you keep it?
Note that /scratch is temporary storage for 3 months or less.
Options?
1 Save on your departmental/local server/workstation(it is possible to transfer TBs per day on a gigabit link);
2 Apply for a project space allocation at next RAC call(but space is very limited);
3 Buy tapes through us ($100/TB) and we can archive yourdata to tape; HSM possibility within next 6 months;
4 Change storage format.
I/O Best Practices
-
8/2/2019 Parallel Io Course
22/75
I/O Best Practices
Monitor and control usage
Minimize use of filesystem commands like ls and du.Regularly check your disk usage using/scinet/gpc/bin/diskUsage.
Warning signs which should prompt careful consideration:
More than 100,000 files in your spaceAverage file size less than 100 MB
Monitor disk actions with top and strace
RAM is always faster than disk; think about using ramdisk.
Use gzip and tar to compress files to bundle many files intoone
Try gziping your data files. 30% not atypical!
Delete files that are no longer needed
Do housekeeping (gzip, tar, delete) regularly.
I/O Best Practices
-
8/2/2019 Parallel Io Course
23/75
I/O Best Practices
Dos
Write binary format filesFaster I/O and less space than ASCII files.
Use parallel I/O if writing from many nodes
Maximize size of files. Large block I/O optimal!
Minimize number of files. Makes filesystem more responsive!
Donts
Dont write lots of ASCII files. Lazy, slow, and wastes space!
Dont write many hundreds of files in a 1 directory. (FileLocks)
Dont write many small files (< 10MB).System is optimized for large-block I/O.
-
8/2/2019 Parallel Io Course
24/75
1 Introduction
2 File Systems and I/O
3 Data Management
4 Break
5 Parallel I/O
6
MPI-IO
7 HDF5/NETCDF
Data Management
-
8/2/2019 Parallel Io Course
25/75
Data Management
Formats
ASCII
BinaryMetaData (XML)
Databases
Standard Librarys (HDF5,NetCDF)
ASCII
-
8/2/2019 Parallel Io Course
26/75
ASCII
American Standard Code for Information Interchange
Pros
Human Readable
Portable (architecture independent)
Cons
Inefficient Storage
Expensive for Read/Write (conversions)
Native Binary
-
8/2/2019 Parallel Io Course
27/75
Native Binary
100100100
Pros
Efficient Storage (256 x floats @4bytes takes 1024 bytes)
Efficient Read/Write (native)
Cons
Have to know the format to read
Portability (Endianness)
ASCII vs. binary
-
8/2/2019 Parallel Io Course
28/75
ASCII vs. binary
Writing 128M doubles
Format /scratch (GPCS) /dev/shm (RAM) /tmp (disk)
ASCII 173s 174s 260sBinary 6s 1s 20s
Syntax
Format C FORTRAN
ASCII fprintf() open(6,file=test,form=formatted)
write(6,*)Binary fwrite() open(6,file=test,form=unformatted)
write(6)
Metadata
-
8/2/2019 Parallel Io Course
29/75
What is Metadata?
Data about DataFile System: size, location, date, owner, etc.
App Data: File format, version, iteration, etc.
Example: XML
UTF1000
6.8
January 15th, 2010
47 23.516 -122 02.625
Databases
-
8/2/2019 Parallel Io Course
30/75
Beyond flat files
Very powerful and flexible storage approach
Data organization and analysis can be greatly simplified
Enhanced performance over seek/sort depending on usage
Open Source Software
SQLite (serverless)PostgreSQL
mySQL
Standard Formats
-
8/2/2019 Parallel Io Course
31/75
CGNS (CFD General Notation System)
IGES/STEP (CAD Geometry)
HDF5 (Hierarchical Data Format)
NetCDF (Network Common Data Format)
disciplineX version
HDF5/NetCDF - Serial
-
8/2/2019 Parallel Io Course
32/75
/
Jonathan
-
8/2/2019 Parallel Io Course
33/75
1 Introduction
2 File Systems and I/O
3 Data Management
4 Break
5 Parallel I/O
6 MPI-IO
7 HDF5/NETCDF
-
8/2/2019 Parallel Io Course
34/75
1 Introduction
2 File Systems and I/O
3 Data Management
4 Break
5 Parallel I/O
6 MPI-IO
7 HDF5/NETCDF
Common Ways of Doing Parallel I/O
-
8/2/2019 Parallel Io Course
35/75
/
Sequential I/O (only proc 0 Writes/Reads)
ProTrivially simple for small I/OSome I/O libraries not parallel
Con
Bandwidth limited by rate one client can sustainMay not have enough memory on node to hold all dataWont scale (built in bottleneck)
Common Ways of Doing Parallel I/O
-
8/2/2019 Parallel Io Course
36/75
N files for N Processes
Pro
No interprocess communication or coordination necessaryPossibly better scaling than single sequential I/O
Con
As process counts increase, lots of (small) files, wont scaleData often must be post-processed into one fileUncoordinated I/O may swamp file system (File LOCKS!)
Common Ways of Doing Parallel I/O
-
8/2/2019 Parallel Io Course
37/75
All Processes Access One File
ProOnly one fileData can be stored canonically, avoiding post-processingWill scale if done correctly
Con
Uncoordinated I/O WILL swamp file system (File LOCKS!)Requires more design and thought
Parallel I/O
-
8/2/2019 Parallel Io Course
38/75
What is Parallel I/O?
Multiple processes of a parallel program accessing data (reading orwriting) from a common file.
Parallel I/O
-
8/2/2019 Parallel Io Course
39/75
Why Parallel I/O?
Non-parallel I/O is simple but:
Poor performance (single process writes to one file)
Awkward and not interoperable with other tools (each processwrites a separate file)
Parallel I/O
Higher performance through collective and contiguous I/OSingle file (visualization, data management, storage, etc)
Works with file system not against it
Contiguous and Noncontiguous I/O
-
8/2/2019 Parallel Io Course
40/75
Contiguous I/O move from a single memory block into a single file block
Noncontiguous I/O has three forms:
Noncontiguous in memory, in file, or in both
Structured data leads naturally to noncontiguous I/O(e.g. block decomposition)
Describing noncontiguous accesses with a single operation passes moreknowledge to I/O system
Independent and Collective I/O
-
8/2/2019 Parallel Io Course
41/75
Independent I/O operations specify only what a single process will do
calls obscure relationships between I/O on other processes
Many applications have phases of computation and I/O
During I/O phases, all processes read/write dataWe can say they are collectively accessing storage
Collective I/O is coordinated access to storage by a group of processes
functions are called by all processes participating in I/OAllows file system to know more about access as a whole, more
optimization in lower software layers, better performance
Parallel I/O
-
8/2/2019 Parallel Io Course
42/75
Available Approaches
MPI-IO: MPI-2 Language Standard
HDF (Hierarchical Data Format)
NetCDF (Network Common Data Format)Adaptable IO System (ADIOS)
Actively developed (OLCF,SandiaNL,GeorgiaTech) and usedon largest HPC systems (Jaguar,Blue Gene/P)External to the code XML file describing the various elements
Uses MPI-IO, can work with HDF/NetCDF
MPI-IO
-
8/2/2019 Parallel Io Course
43/75
MPI-IO
MPI-IO
-
8/2/2019 Parallel Io Course
44/75
MPIMPI: Message Passing Interface
Language-independent communications protocol used toprogram parallel computers.
MPI-IO: Parallel file access protocol
MPI-IO: The parallel I/O part of the MPI-2 standard (1996).
Many other parallel I/O solutions are built upon it.
Versatile and better performance than standard unix I/O.Usually collective I/O is the most efficient.
MPI-IO
-
8/2/2019 Parallel Io Course
45/75
Advantages MPI-IO
noncontiguous access of files and memory
collective I/O
individual and shared file pointers
explicit offsets
portable data representation
can give hints to implementation/file system
no text/formatted output!
MPI-IO
-
8/2/2019 Parallel Io Course
46/75
MPI conceptsProcess: An instance of your program, often 1 per core.
Communicator: Groups of processes and their topology.Standard communicators:
MPI COMM WORLD: all processes launched by mpirun.MPI COMM SELF: just this process.
Size: the number of processes in the communicator.
Rank: a unique number assigned to each process in thecommunicator group.
When using MPI, each process always call MPI INIT at thebeginning and MPI FINALIZE at the end of your program.
MPI-IO
-
8/2/2019 Parallel Io Course
47/75
Basic MPI code example
in C:# i n c l u d e < m p i . h >
i n t m a i n ( i n t a r g c , c h a r * * a r g v )
{i n t r a n k , n p r o c s ;
M P I I n i t ( & a r g c , & a r g v ) ;
M P I C o m m s i z e
( M P I C O M M W O R L D , & n p r o c s ) ;
M P I C o m m r a n k
( M P I C O M M W O R L D , & r a n k ) ;
. . .
M P I F i n a l i z e ( ) ;
r e t u r n 0 ;
}
in Fortran:p r o g r a m m a i n
i n c l u d e ' m p i f . h '
i n t e g e r r a n k , n p r o c s
i n t e g e r i e r r
c a l l M P I I N I T ( i e r r )
c a l l M P I C O M M S I Z E &
( M P I C O M M W O R L D , n p r o c s , i e r r )
c a l l M P I C O M M R A N K &
( M P I C O M M W O R L D , r a n k , i e r r )
. . .
c a l l M P I F I N A L I Z E ( i e r r )
r e t u r n
e n d
MPI-IO
-
8/2/2019 Parallel Io Course
48/75
MPI-IO exploits analogies with MPI
Writing Sending message
Reading Receiving message
File access grouped via communicator: collective operationsUser defined MPI datatypes for e.g. noncontiguous data layout
IO latency hiding much like communication latency hiding(IO may even share network with communication)
All functionality through function calls.
MPI-IOBasic I/O Operations - C
-
8/2/2019 Parallel Io Course
49/75
i n t M P I F i l e o p e n
(M P I C o m m
comm,c h a r *
filename,i n t
amode,M P I I n f o info, M P I F i l e * fh)
i n t M P I F i l e s e e k (M P I F i l e fh,M P I O f f s e t offset, i n t to)
i n t M P I F i l e s e t v i e w (M P I F i l e fh, M P I O f f s e t disp,
M P I D a t a t y p e etype,M P I D a t a t y p e
filetype,
c h a r * datarep, M P I I n f o info)
i n t M P I F i l e r e a d (M P I F i l e fh, v o i d * buf, i n t count,
M P I D a t a t y p e
datatype,M P I S t a t u s *
status)i n t M P I F i l e w r i t e (M P I F i l e fh, v o i d * buf, i n t count,
M P I D a t a t y p e datatype,
M P I S t a t u s * status)
i n t M P I F i l e c l o s e (
M P I F i l e * fh)
MPI-IOBasic I/O Operations - Fortran
-
8/2/2019 Parallel Io Course
50/75
M P I F I L E O P E N (comm,filename,amode,info,fh,ierr)c h a r a c t e r * ( * )
filenamei n t e g e r comm,amode,info,fh,ierr
M P I F I L E S E E K (fh,offset,whence,ierr)i n t e g e r ( k i n d = M P I O F F S E T K I N D )
offseti n t e g e r fh,whence,ierr
M P I F I L E S E T V I E W (fh,disp,etype,filetype,datarep,info,ierr)i n t e g e r ( k i n d = M P I O F F S E T K I N D ) dispi n t e g e r
fh,etype,filetype,info,ierrc h a r a c t e r * ( * ) datarep
M P I F I L E R E A D (fh,buf,count,datatype,status,ierr) t y p e buf(*)i n t e g e r
fh,count,datatype,status(MPI STATUS SIZE),ierrM P I F I L E W R I T E (fh,buf,count,datatype,status,ierr) t y p e buf(*)i n t e g e r fh,count,datatype,status(MPI STATUS SIZE),ierr
M P I F I L E C L O S E (fh)i n t e g e r fh
MPI-IOOpening and closing a file
-
8/2/2019 Parallel Io Course
51/75
Files are maintained via file handles. Open files with M P I F i l e o p e n .The following codes open a file for reading, and close it right away:
in C:M P I F I L E f h ;
M P I F i l e o p e n ( M P I C O M M W O R L D , " t e s t . d a t " , M P I M O D E R D O N L Y ,
M P I I N F O N U L L , & f h ) ;
M P I F i l e c l o s e ( & f h ) ;
in Fortran:i n t e g e r f h , i e r r
c a l l M P I F I L E O P E N ( M P I C O M M W O R L D , " t e s t . d a t " , &
M P I M O D E R D O N L Y , M P I I N F O N U L L , f h , i e r r )
c a l l M P I F I L E C L O S E ( f h , i e r r )
MPI-IO
-
8/2/2019 Parallel Io Course
52/75
Opening a file requires...
communicator,
file name,file handle, for all future reference to file,file mode, made up of combinations of:
M P I M O D E R D O N L Y read only
M P I M O D E R D W R reading and writingM P I M O D E W R O N L Y
write onlyM P I M O D E C R E A T E create file if it does not existM P I M O D E E X C L
error if creating file that existsM P I M O D E D E L E T E O N C L O S E
delete file on closeM P I M O D E U N I Q U E O P E N
file not to be opened elsewhereM P I M O D E S E Q U E N T I A L
file to be accessed sequentiallyM P I M O D E A P P E N D position all file pointers to end
info structure, or M P I I N F O N U L L ,In Fortran, error code is the functions last argument
In C, the function returns the error code.
MPI-IO
-
8/2/2019 Parallel Io Course
53/75
etypes, filetypes, file viewsTo make binary access a bit more natural for many applications,MPI-IO defines file access through the following concepts:
1 etype: Allows to access the file in units other than bytes.Some parameters have to be given in bytes.
2 filetype: Each process defines what part of a shared file it uses.
Filetypes specify a pattern which gets repeated in the file.Useful for noncontiguous access.For contiguous access, often etype=filetype.
3
displacement: Where to start in the file, in bytes.Together, these specify the file view, set by
M P I F i l e s e t v i e w .
Default view has etype=filetype=MPI BYTE and displacement 0.
MPI-IOContiguous Data
-
8/2/2019 Parallel Io Course
54/75
Processes
P(0)q P(1)q P(2)q P(3)q . .
view(0) q view(1) q view(2) q view(3) qq
One filei n t b u f [ . . . ] ;
M P I O f f s e t b u f s i z e = . . . ;
M P I F i l e o p e n ( M P I C O M M W O R L D , " f i l e " , M P I M O D E W R O N L Y ,
M P I I N F O N U L L , & f h ) ;
M P I O f f s e t d i s p = r a n k * b u f s i z e * s i z e o f ( i n t ) ;
M P I F i l e s e t v i e w ( f h , d i s p , M P I I N T , M P I I N T , " n a t i v e " ,
M P I I N F O N U L L ) ;
M P I F i l e w r i t e ( f h , b u f , b u f s i z e , M P I I N T , M P I S T A T U S I G N O R E ) ;
M P I F i l e c l o s e ( & f h ) ;
MPI-IOOverview of all read functions
-
8/2/2019 Parallel Io Course
55/75
Single task CollectiveIndividual file pointer
blockingM P I F i l e r e a d M P I F i l e r e a d a l l
nonblockingM P I F i l e i r e a d M P I F i l e r e a d a l l b e g i n
+ ( M P I W a i t ) M P I F i l e r e a d a l l e n d
Explicit offsetblocking
M P I F i l e r e a d a t M P I F i l e r e a d a t a l l
nonblockingM P I F i l e i r e a d a t M P I F i l e r e a d a t a l l b e g i n
+ ( M P I W a i t ) M P I F i l e r e a d a t a l l e n d
Shared file pointer
blockingM P I F i l e r e a d s h a r e d M P I F i l e r e a d o r d e r e d
nonblockingM P I F i l e i r e a d s h a r e d M P I F i l e r e a d o r d e r e d b e g i n
+ ( M P I W a i t ) M P I F i l e r e a d o r d e r e d e n d
MPI-IOOverview of all write functions
-
8/2/2019 Parallel Io Course
56/75
Single task CollectiveIndividual file pointer
blockingM P I F i l e w r i t e M P I F i l e w r i t e a l l
nonblockingM P I F i l e i w r i t e M P I F i l e w r i t e a l l b e g i n
+ ( M P I W a i t ) M P I F i l e w r i t e a l l e n d
Explicit offsetblocking
M P I F i l e w r i t e a t M P I F i l e w r i t e a t a l l
nonblockingM P I F i l e i w r i t e a t M P I F i l e w r i t e a t a l l b e g i n
+ ( M P I W a i t ) M P I F i l e w r i t e a t a l l e n d
Shared file pointer
blockingM P I F i l e w r i t e s h a r e d M P I F i l e w r i t e o r d e r e d
nonblockingM P I F i l e i w r i t e s h a r e d M P I F i l e w r i t e o r d e r e d b e g i n
+ ( M P I W a i t ) M P I F i l e w r i t e o r d e r e d e n d
MPI-IO
-
8/2/2019 Parallel Io Course
57/75
Collective vs. single task
After a file has been opened and a fileview is defined, processescan independently read and write to their part of the file.
If the IO occurs at regular spots in the program, which differentprocesses reach the same time, it will be better to use collective
I/O: These are the a l l versions of the MPI-IO routines.
Two file pointers
An MPI-IO file has two different file pointers:
1 individual file pointer: one per process.2 shared file pointer: one per file: s h a r e d / o r d e r e d
Shared doesnt mean collective, but does imply synchronization!
MPI-IOStrategic considerations
-
8/2/2019 Parallel Io Course
58/75
Pros for single task I/O
One can virtually always use only indivivual file pointers,
If timings variable, no need to wait for other processes
Cons
If there are interdependences between how processes write,there may be collective I/O operations may be faster.
Collective I/O can collect data before doing the write or read.
True speed depends on file system, size of data to write andimplementation.
MPI-IONoncontiguous Data
-
8/2/2019 Parallel Io Course
59/75
Processes
P(0)q P(1)q P(2)q P(3)q .
etc. .q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q
One file
Filetypes to the rescue!
Define a 2-etype basic MPI Datatype.
Increase its size to 8 etypes.
Shift according to rank to pick out the right 2 etypes.
Use the result as the filetype in the file view.
Then gaps are automatically skipped.
MPI-IOOverview of data/filetype constructors
-
8/2/2019 Parallel Io Course
60/75
Function Creates a. . .
M P I T y p e c o n t i g u o u s contiguous datatype
M P I T y p e v e c t o r vector (strided) datatypeM P I T y p e i n d e x e d
indexed datatypeM P I T y p e i n d e x e d b l o c k indexed datatype w/uniform block lengthM P I T y p e c r e a t e s t r u c t
structured datatypeM P I T y p e c r e a t e r e s i z e d
type with new extent and boundsM P I T y p e c r e a t e d a r r a y
distributed array datatypeM P I T y p e c r e a t e s u b a r r a y
n-dim subarray of an n-dim array. . .
Before using the create type, you have to doM P I C o m m i t
.
MPI-IOAccessing a noncontiguous file type
-
8/2/2019 Parallel Io Course
61/75
q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q
in C:M P I D a t a t y p e c o n t i g , f t y p e ;
M P I D a t a t y p e e t y p e = M P I I N T ;
M P I A i n t e x t e n t = s i z e o f ( i n t ) * 8 ; / * i n b y t e s ! * /
M P I O f f s e t d = 2 * s i z e o f ( i n t ) * r a n k ; / * i n b y t e s ! * /
M P I T y p e c o n t i g u o u s ( 2 , e t y p e , & c o n t i g ) ;
M P I T y p e c r e a t e r e s i z e d ( c o n t i g , 0 , e x t e n t , & f t y p e ) ;
M P I T y p e c o m m i t ( & f t y p e ) ;
M P I F i l e s e t v i e w ( f h , d , e t y p e , f t y p e , " n a t i v e " ,
M P I I N F O N U L L ) ;
MPI-IOAccessing a noncontiguous file type
-
8/2/2019 Parallel Io Course
62/75
q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q
in Fortran:i n t e g e r : : e t y p e , e x t e n t , c o n t i g , f t y p e , i e r r
i n t e g e r ( k i n d = M P I O F F S E T K I N D ) : : d
e t y p e = M P I I N T
e x t e n t = 4 * 8
d = 4 * r a n k
c a l l M P I T Y P E C O N T I G U O U S ( 2 , e t y p e , c o n t i g , i e r r )
c a l l M P I T Y P E C R E A T E R E S I Z E D ( c o n t i g , 0 , e x t e n t , f t y p e , i e r r )
c a l l M P I T Y P E C O M M I T ( f t y p e , i e r r )
c a l l M P I F I L E S E T V I E W ( f h , d , e t y p e , f t y p e , " n a t i v e " ,
M P I I N F O N U L L , i e r r )
MPI-IO
-
8/2/2019 Parallel Io Course
63/75
File data representation
native: Data is stored in the file as it is in memory:no conversion is performed. No loss inperformance, but not portable.
internal: Implementation dependent conversion.Portable across machines with the sameMPI implementation, but not acrossdifferent implementations.
external32: Specific data representation, basically32-bit big-endian IEEE format. See MPI
Standard for more info. Completelyportable, but not the best performance.
These have to be given to MPI File set view as strings.
MPI-IO
More noncontiguous data: subarrays
-
8/2/2019 Parallel Io Course
64/75
More noncontiguous data: subarrays
What if theres a 2d matrix that is distributed across processes?
proc0q proc1q proc2q proc3q .
.q .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q
.q .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q
.q .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q
.q .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q
.q .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q
.q .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q
q
Common cases of noncontiguous access specialized functions:M P I F i l e c r e a t e s u b a r r a y
&M P I F i l e c r e a t e d a r r a y
.
MPI-IOMore noncontiguous data: subarrays
-
8/2/2019 Parallel Io Course
65/75
i n t g s i z e s [ 2 ] = {1 6 , 6 };i n t l s i z e s [ 2 ] = {8 , 3 };i n t p s i z e s [ 2 ] = {2 , 2 };i n t c o o r d s [ 2 ] =
{r a n k % p s i z e s [ 0 ] , r a n k / p s i z e s [ 0 ]
};
i n t s t a r t s [ 2 ] = {
c o o r d s [ 0 ] * l s i z e s [ 0 ] , c o o r d s [ 1 ] * l s i z e s [ 1 ] }
;
M P I T y p e c r e a t e s u b a r r a y ( 2 , g s i z e s , l s i z e s , s t a r t s , ,
M P I O R D E R C , M P I I N T , & f i l e t y p e ) ;
M P I T y p e c o m m i t ( & f i l e t y p e ) ;
M P I F i l e s e t v i e w ( f h , 0 , M P I I N T , f i l e t y p e , " n a t i v e " ,
M P I I N F O N U L L ) ;
M P I F i l e w r i t e a l l ( f h , l o c a l a r r a y , l o c a l a r r a y s i z e , M P I I N T ,
M P I S T A T U S I G N O R E ) ;
Tip
M P I C a r t c r e a t e can be useful to compute coords for a proc.
MPI-IOExample: N-body MD checkpointing
-
8/2/2019 Parallel Io Course
66/75
Challenge
Simulate n-body molecular dynamics, Lennard-Jones potential.
Parallel MPI run: atoms distributed.
At intervals, checkpoint the state of system to 1 file.
Restart should be allowed to use more or less mpi processes.
Restart should be efficient.
State of the system: total of tot atoms with properties:
s t r u c t A t o m {
d o u b l e q [ 3 ] ;
d o u b l e p [ 3 ] ;
l o n g l o n g t a g ;
l o n g l o n g i d ;
};
t y p e a t o m
d o u b l e p r e c i s i o n : : q ( 3 )
d o u b l e p r e c i s i o n : : p ( 3 )
i n t e g e r ( k i n d = 8 ) : : t a g
i n t e g e r ( k i n d = 8 ) : : i d
e n d t y p e a t o m
MPI-IOExample: N-body MD checkpointing
-
8/2/2019 Parallel Io Course
67/75
Issues
Atom data more than array of doubles: indices etc.
Writing problem: processes have different # of atomshow to define views, subarrays, ... ?
Reading problem: not known which process gets which atoms.
Even worse if number of processes is changed in restart.
Approach
Abstract the atom datatype.
Compute where in file each proc. should write + how much.Store that info in header.
Restart with same nprocs is then straightforward.
Different nprocs: MPI exercise outside scope of 1-day class.
MPI-IOExample: N-body MD checkpointing
-
8/2/2019 Parallel Io Course
68/75
Defining the Atom etype
s t r u c t A t o m a t o m ;
i n t i , l e n [ 4 ] = {
3 , 3 , 1 , 1 }
;
M P I A i n t a d d r [ 5 ] ;
M P I A i n t d i s p [ 4 ] ;
M P I G e t a d d r e s s ( & a t o m , & ( a d d r [ 0 ] ) ) ;
M P I G e t a d d r e s s ( & ( a t o m . q [ 0 ] ) , & ( a d d r [ 1 ] ) ) ;
M P I G e t a d d r e s s ( & ( a t o m . p [ 0 ] ) , & ( a d d r [ 2 ] ) ) ;
M P I G e t a d d r e s s ( & a t o m . t a g , & ( a d d r [ 3 ] ) ) ;
M P I G e t a d d r e s s ( & a t o m . i d , & ( a d d r [ 4 ] ) ) ;
f o r ( i = 0 ; i < 4 ; i + + )
d i s p [ i ] = a d d r [ i ] - a d d r [ 0 ] ;
M P I D a t a t y p e t [ 4 ]
={
M P I D O U B L E , M P I D O U B L E , M P I L O N G L O N G , M P I L O N G L O N G }
;
M P I T y p e c r e a t e s t r u c t ( 4 , l e n , d i s p , t , & M P I A T O M ) ;
M P I T y p e c o m m i t ( & M P I A T O M ) ;
MPI-IOExample: N-body MD checkpointing
-
8/2/2019 Parallel Io Course
69/75
File structuretotq pq n[0]q n[1]q.. n[p-1]q n[0]atomsq n[1]atomsq.. n[p-1]atomsq
header: integers
body: atoms
Functions
v o i d c p w r i t e ( s t r u c t A t o m * a, i n t n, M P I D a t a t y p e t,c h a r * f, M P I C o m m c)
v o i d c p r e a d ( s t r u c t A t o m * a, i n t * n, i n t tot,M P I D a t a t y p e t, c h a r * f, M P I C o m m c)
v o i d r e d i s t r i b u t e () consider done
MPI-IO: Checkpoint writing
v o i d c p w r i t e ( s t r u c t A t o m * a , i n t n , M P I D a t a t y p e t , c h a r * f , M P I C o m m c ) {
-
8/2/2019 Parallel Io Course
70/75
i n t p , r ;
M P I C o m m s i z e ( c , & p ) ;
M P I C o m m r a n k ( c , & r ) ;
i n t h e a d e r [ p + 2 ] ;
M P I A l l g a t h e r ( & n , 1 , M P I I N T , & ( h e a d e r [ 2 ] ) , 1 , M P I I N T , c ) ;
i n t i , n b e l o w = 0 ;
f o r ( i = 0 ; i < r ; i + + )
n b e l o w + = h e a d e r [ i + 2 ] ;
M P I F i l e h ;
M P I F i l e o p e n ( c , f , M P I M O D E C R E A T E | M P I M O D E W R O N L Y , M P I I N F O N U L L , & h ) ;
i f ( r = = p - 1 ) {
h e a d e r [ 0 ] = n b e l o w + n ;
h e a d e r [ 1 ] = p ;
M P I F i l e w r i t e ( h , h e a d e r , p + 2 , M P I I N T , M P I S T A T U S I G N O R E ) ;
}M P I F i l e s e t v i e w ( h , ( p + 2 ) * s i z e o f ( i n t ) , t , t , " n a t i v e " , M P I I N F O N U L L ) ;
M P I F i l e w r i t e a t ( h , n b e l o w , a , n , t , M P I S T A T U S I G N O R E ) ;
M P I F i l e c l o s e ( & h ) ;
}
MPI-IOExample: N-body MD checkpointing
-
8/2/2019 Parallel Io Course
71/75
Reading and redistributing atoms
Copy example lj code to your own directory.$ cp -r ~rzon/Courses/lj .
$ cd lj
Get your environment setup with
$ source ./envsetupType make to build.$ make
Run:$ mpirun -np 8 lj run.ini
Creates 70778.cp as a checkpoint (try ls -l).
Rerun to see that it successfully reads the checkpoint.
Run again with different # of processors. What happens?
MPI-IOExample: N-body MD checkpointing
-
8/2/2019 Parallel Io Course
72/75
Reading and redistributing atoms
The lj program has an option to produce an xsf file with thetrajectory, which can be displayed in vmd.
Edit run.ini, change the line 0 to 5
Careful: xsf is an ascii format, and the file quickly gets huge!
Do:mpirun -np 8 lj run.ini
Note difference in speed!
Open in vmd:$ vmd pos.xsf
MPI-IOExample: N-body MD checkpointing
-
8/2/2019 Parallel Io Course
73/75
Checking binary files
Interactive binary file viewer in ~rzon/Courses/lj: cbinUseful for quickly checking your binary format, without having towrite a test program.
Start the program with:
$ cbin 70778.cpGets you a prompt, with the file loaded.
Commands at the prompt are a letter plus optional number.
E.g. i2 reads 2 integers, d6 reads 6 doubles.
Has support to reverse the endian-ness, switching betweenloaded files, and moving the file pointer.
Type ? for a list of commands.
Check the format of the file.
MPI-IO
-
8/2/2019 Parallel Io Course
74/75
Good References on MPI-IOW. Gropp, E. Lusk, and R. Thakur, Using MPI-2: AdvancedFeatures of the Message-Passing Interface (MIT Press, 1999).
J. H. May, Parallel I/O for High Performance Computing
(Morgan Kaufmann, 2000).
W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk,B. Nitzberg, W. Saphir, and M. Snir, MPI-2: The CompleteReference: Volume 2, The MPI-2 Extensions
(MIT Press, 1998).
HDF5/NETCDF
-
8/2/2019 Parallel Io Course
75/75