High Performance Computing Course Notes 2007-2008 Parallel I/O.
-
Upload
chad-harmes -
Category
Documents
-
view
212 -
download
0
Transcript of High Performance Computing Course Notes 2007-2008 Parallel I/O.
High Performance ComputingHigh Performance ComputingCourse Notes 2007-2008Course Notes 2007-2008
Parallel I/OParallel I/O
2Computer Science, University of WarwickComputer Science, University of Warwick
Aims
To learn how to achieve higher I/O performance
To use a concrete implementation (MPI-IO):
Some concepts, including: etypes, displacement and views
Collective vs. non-collective I/O
Contiguous vs. non-contiguous I/O
High Performance Parallel I/OHigh Performance Parallel I/O
3Computer Science, University of WarwickComputer Science, University of Warwick
Why are we looking at parallel I/O?Why are we looking at parallel I/O?
I/O is a major bottleneck in many parallel applications
I/O subsystems for parallel machines may be designed for high performance, however many applications achieve < 10th of the peak I/O bandwidth
Parallel-I/O systems designed for large data transfer (MB of data)
However, many parallel applications make many smaller I/O requests (<kB)
4Computer Science, University of WarwickComputer Science, University of Warwick
Parallel I/O – version 1.0Parallel I/O – version 1.0
Phase 1:
All processes send
data to proc. 00 1 2 3
d0 d1 d2 d3
Early solutions:
All processes send data to process 0, which then writes to file
5Computer Science, University of WarwickComputer Science, University of Warwick
Parallel I/O – version 1.0Parallel I/O – version 1.0
Phase 1:
All processes send
data to proc. 0
Phase 2:
Proc. 0 writes to
file
0 1 2 3
d0 d1 d2 d3
d0 d1 d2 d3File
Early solutions:
All processes send data to process 0, which then writes to file
6Computer Science, University of WarwickComputer Science, University of Warwick
Bad things about version 1.0
1. Single node bottleneck
2. Poor performance
3. Poor scalability
4. Single point of failure
Good things about version 1.0
The parallel machine needs only support I/O from one process
Do not need specialized I/O library
If you are converting from sequential code then this parallel version (of program) is close to original
Results in a single file which is easy to manage
Parallel I/O – version 1.0Parallel I/O – version 1.0
7Computer Science, University of WarwickComputer Science, University of Warwick
Parallel I/O – version 2.0Parallel I/O – version 2.0
All processes can
now write in one
phase0 1 2 3
d0 d1 d2 d3
Each process writes to a separate file
d0 d1 d2 d3
File 1 File 2 File 3 File 4
8Computer Science, University of WarwickComputer Science, University of Warwick
Parallel I/O – version 2.0Parallel I/O – version 2.0
All processes can
now write in one
phase0 1 2 3
d0 d1 d2 d3
Each process writes to a separate file
d0 d1 d2 d3
File 1 File 2 File 3 File 4
Good things about version 2.0
1. Now we are doing things in parallel
2. High performance
9Computer Science, University of WarwickComputer Science, University of Warwick
Parallel I/O – version 2.0Parallel I/O – version 2.0
All processes can
now write in one
phase0 1 2 3
d0 d1 d2 d3
Each process writes to a separate file
d0 d1 d2 d3
File 1 File 2 File 3 File 4
Bad things about version 2.0
1. We now have lots of small files to manage
2. How do we read the data back when #procs changes?
3. Does not interoperate well with other applications
10Computer Science, University of WarwickComputer Science, University of Warwick
Parallel I/O – version 3.0Parallel I/O – version 3.0
All processes can
now write in one
Phase, to one common
file
0 1 2 3
d0 d1 d2 d3
Multiple processes of parallel program access (read/write)
data from a common file
d0 d1 d2 d3File
11Computer Science, University of WarwickComputer Science, University of Warwick
Parallel I/O – version 3.0Parallel I/O – version 3.0
Good things about version 3.0
Simultaneous I/O from any number of processes
Maps well onto collective operations
Excellent performance and scalability
Results in a single file which is easy to manage and interoperates well with other applications
Bad things about version 3.0
Requires more complex I/O library support
12Computer Science, University of WarwickComputer Science, University of Warwick
What is Parallel I/O?What is Parallel I/O?
Multiple processes of a parallel program accessing data (reading or writing) from a common file
FILE
P0 P1 P2 P(n-1)
13Computer Science, University of WarwickComputer Science, University of Warwick
Non-parallel I/0
Simple
Poor performance – if a single process is writing to one file
Hard to interoperate with other applications – if writing to more than one file
Parallel I/O
Provides high performance
Provides a single file with which it is easy to interoperate with other tools (e.g. visualization systems)
If you design it right then can use existing features of parallel libraries such as collectives and derived datatypes
Why Parallel I/O?Why Parallel I/O?
14Computer Science, University of WarwickComputer Science, University of Warwick
We are going to be looking at parallel I/O in the context of MPI, why?
Because writing is like sending a message, reading is like receiving
Because collective-like operations are important in parallel I/O
Because non-contiguous data layout is important (if we are using a single file), supported by MPI datatypes
Parallel I/O is now integral part of MPI-2
Why Parallel I/O?Why Parallel I/O?
15Computer Science, University of WarwickComputer Science, University of Warwick
Parallel I/O exampleParallel I/O example
Consider an example of a 2D array distributed among 16 processors
Array stored in row-major order
P0 P1
P4 P5
P8
P2 P3
P6 P7
P12
P9 P10 P11
P13 P14 P15
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P4 P5 P6 P7
P4 P5 P6 P7
etc…
Array Corresponding file
16Computer Science, University of WarwickComputer Science, University of Warwick
Access pattern 1Access pattern 1: MPI_File_seek: MPI_File_seek
Updates the individual file pointer
int MPI_File_seek( MPI_File mpi_fh, MPI_Offset offset, int whence );
Parameters
mpi_fh : [in] file handle (handle)
offset : [in] file offset (integer)
whence : [in] update mode (state)
MPI_FILE_SEEK updates the individual file pointer according to whence, which has the following possible values:
MPI_SEEK_SET: the pointer is set to offset
MPI_SEEK_CUR: the pointer is set to the current pointer position plus offset
MPI_SEEK_END: the pointer is set to the end of file plus offset
17Computer Science, University of WarwickComputer Science, University of Warwick
Access pattern 1Access pattern 1: MPI_File_read: MPI_File_read
Read using individual file pointer
int MPI_File_read( MPI_File mpi_fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status );
Parameters
mpi_fh: [in] file handle (handle)
buf: [out] initial address of buffer
count: [in] number of elements in buffer (nonnegative integer)
datatype: [in] datatype of each buffer element (handle)
status: [out] status object (Status)
18Computer Science, University of WarwickComputer Science, University of Warwick
Access pattern 1Access pattern 1
We could do a UNIX-style access pattern in MPI-IO
One independent read request is done for each row in the local array
Many independent, contiguous requests
MPI_File_open(… , “filename”, … , &fh)
for(i=0; i < n_local_rows; i++) {
MPI_File_seek (fh, offset, …)
MPI_File_read (fh, row[i], …)
}
MPI_File_close (&fh)
Individual file pointers per process per file handle
Each process sets the file pointer with some suitable offset
The data is then read into the local array
This is not a collective operation (non-blocking)
19Computer Science, University of WarwickComputer Science, University of Warwick
Access pattern 2: Access pattern 2: MPI_File_read_all
Collective read using individual file pointer
int MPI_File_read_all( MPI_File mpi_fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status );
Parameters
fh : [in] file handle (handle)
buf : [out] initial address of buffer (choice)
count : [in] number of elements in buffer (nonnegative integer)
datatype : [in] datatype of each buffer element (handle)
status : [out] status object (Status)
MPI_FILE_READ_ALL is a collective version of the blocking MPI_FILE_READ interface.
20Computer Science, University of WarwickComputer Science, University of Warwick
Access pattern 2Access pattern 2
Similar to access pattern 1 but using collectives
All processes that opened file will read data together (with own access information)
Many collective, contiguous requests
MPI_File_open(… , “filename”, … , &fh)
for(i=0; i < n_local_rows; i++) {
MPI_File_seek (fh, offset, …)
MPI_File_read_all (fh, row[i], …)
}
MPI_File_close (&fh)
read_all is a collective version of the read operation
This is blocking
Each process accesses the file at the same time
This may be useful as independent I/O operations do not convey what other procs are doing at the same time
21Computer Science, University of WarwickComputer Science, University of Warwick
File
Ordered collection of typed data items
MPI supports random or sequential access
Opened collectively by a group of processes
All collective I/O calls on file are done over this group
Displacement
Absolute byte position relative to the beginning of a file
Defines the location where a view begins
etype (elementary datatype)
Unit of data access and positioning
Can be a predefined or derived datatype
Offsets are expressed as multiples of etypes
Access pattern 3: DefinitionsAccess pattern 3: Definitions
22Computer Science, University of WarwickComputer Science, University of Warwick
Filetype
Basis for partitioning the file among processes and defines a template for accessing the file (based on etype)
View
Current set of data visible and accessible from an open file (as an ordered set of etypes)
Each process has its own view based on - a displacement, etype and filetype
Pattern defined by filetype is repeated (in units of etypes) beginning at the displacement
Access pattern 3: DefinitionsAccess pattern 3: Definitions
23Computer Science, University of WarwickComputer Science, University of Warwick
Access pattern 3: Access pattern 3: File ViewsFile Views
Specified by a triplet (displacement, etype, and filetype) passed to MPI_File_set_view
displacement = number of bytes to be skipped from the start of the file
etype = basic unit of data access (can be any basic or derived datatype)
filetype = specifies which portion of the file is visible to the process
24Computer Science, University of WarwickComputer Science, University of Warwick
Access pattern 3: Access pattern 3: A Simple A Simple Noncontiguous File View ExampleNoncontiguous File View Example
etype = MPI_INT
filetype = two MPI_INTs followed by a gap of four MPI_INTs
displacement filetype filetype and so on...
FILEhead of file
25Computer Science, University of WarwickComputer Science, University of Warwick
Access pattern 3: How do Access pattern 3: How do viewsviews relate to relate to multiple processes?multiple processes?
proc. 0 filetype
file …
displacement
proc. 1 filetype
proc. 2 filetype
Group of processes using complementary views to achieve
global data distribution
Partitioning a file among parallel processes
26Computer Science, University of WarwickComputer Science, University of Warwick
MPI_File_set_viewMPI_File_set_view
Describes that part of the file accessed by a single MPI process.
int MPI_File_set_view( MPI_File mpi_fh, MPI_Offset disp, MPI_Datatype etype, MPI_Datatype filetype, char *datarep, MPI_Info info );
Parameters
mpi_fh :[in] file handle (handle)
disp :[in] displacement (nonnegative integer)
etype :[in] elementary datatype (handle)
filetype :[in] filetype (handle)
datarep :[in] data representation (string)
info :[in] info object (handle)
27Computer Science, University of WarwickComputer Science, University of Warwick
Access pattern 3: Access pattern 3: File View ExampleFile View Example
MPI_File thefile;
for (i=0; i<BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i;MPI_File_open(MPI_COMM_WORLD, "testfile",
MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &thefile);
MPI_File_set_view(thefile, myrank * BUFSIZE * sizeof(int), MPI_INT, MPI_INT, "native",
MPI_INFO_NULL);MPI_File_write(thefile, buf, BUFSIZE, MPI_INT,
MPI_STATUS_IGNORE);MPI_File_close(&thefile);
28Computer Science, University of WarwickComputer Science, University of Warwick
MPI_Type_create_subarrayMPI_Type_create_subarray
Create a datatype for a subarray of a regular, multidimensional array
int MPI_Type_create_subarray( int ndims, int array_of_sizes[], int array_of_subsizes[], int array_of_starts[], int order, MPI_Datatype oldtype, MPI_Datatype *newtype );
Parameters
ndims :[in] number of array dimensions (positive integer)
array_of_sizes :[in] number of elements of type oldtype in each dimension of the full array (array of positive integers)
array_of_subsizes :[in] number of elements of type oldtype in each dimension of the subarray (array of positive integers)
array_of_starts :[in] starting coordinates of the subarray in each dimension (array of nonnegative integers)
order :[in] array storage order flag (state)
oldtype :[in] array element datatype (handle)
newtype :[out] new datatype (handle)
29Computer Science, University of WarwickComputer Science, University of Warwick
Using the Subarray DatatypeUsing the Subarray Datatype
gsizes[0] = 16; /* no. of rows in global array */gsizes[1] = 16; /* no. of columns in global array*/
psizes[0] = 4; /* no. of procs. in vertical dimension */psizes[1] = 4; /* no. of procs. in horizontal dimension */
lsizes[0] = 16/psizes[0]; /* no. of rows in local array */lsizes[1] = 16/psizes[1]; /* no. of columns in local array*/
dims[0] = 4; dims[1] = 4;periods[0] = periods[1] = 1;MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm);MPI_Comm_rank(comm, &rank);MPI_Cart_coords(comm, rank, 2, coords);
30Computer Science, University of WarwickComputer Science, University of Warwick
Subarray Datatype contd.Subarray Datatype contd.
/* global indices of first element of local array */start_indices[0] = coords[0] * lsizes[0];start_indices[1] = coords[1] * lsizes[1];
MPI_Type_create_subarray(2, gsizes, lsizes, start_indices, MPI_ORDER_C, MPI_FLOAT, &filetype);MPI_Type_commit(&filetype);
31Computer Science, University of WarwickComputer Science, University of Warwick
Access pattern 3Access pattern 3
Each process creates a derived datatype to describe the non-contiguous access pattern
We thus have a file view and independent access
Single independent, non-contiguous request
MPI_Type_create_subarray
(… , &subarray, …)
MPI_Type_commit (&subarray)
MPI_File_open(… , “filename”, … , &fh)
MPI_File_set_view (fh, … , subarray, …)
MPI_File_read (fh, local_array, …)
MPI_File_close (&fh)
Creates a datatype describing a subarray of a multi-dimentional array
Commits the datatype (must be done before comms)
System may compile at commit time an internal representation for the datatype
32Computer Science, University of WarwickComputer Science, University of Warwick
Access pattern 3Access pattern 3
Each process creates a derived datatype to describe the non-contiguous access pattern
We thus have a file view and independent access
Single independent, non-contiguous request
MPI_Type_create_subarray
(… , &subarray, …)
MPI_Type_commit (&subarray)
MPI_File_open(… , “filename”, … , &fh)
MPI_File_set_view (fh, … , subarray, …)
MPI_File_read (fh, local_array, …)
MPI_File_close (&fh)
Opens the file as before
Now changes the processes view of the data in the file using set_view
set_view is collective
Although the reads are still independent
33Computer Science, University of WarwickComputer Science, University of Warwick
Note here that we are reading the whole sub-array despite the non-contiguous storage
P0 P1
P4 P5
P8
P2 P3
P6 P7
P12
P9 P10 P11
P13 P14 P15
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P4 P5 P6 P7
P4 P5 P6 P7
etc…
Processes {4,5,6,7}, {8,9,10,11}, {12,13,14,15} will have file views
based on the same filetypes but with different displacements
Access pattern 3Access pattern 3
proc. 0 filetype
proc. 1 filetype
proc. 2 filetype
proc. 3 filetype
34Computer Science, University of WarwickComputer Science, University of Warwick
Access pattern 4Access pattern 4
Each process creates a derived datatype to describe the non-contiguous access pattern
We thus have a file view and collective access
Single collective, non-contiguous request
MPI_Type_create_subarray
(… , &subarray, …)
MPI_Type_commit (&subarray)
MPI_File_open(… , “filename”, … , &fh)
MPI_File_set_view (fh, … , subarray, …)
MPI_File_read_all (fh, local_array, …)
MPI_File_close (&fh)
Creates and commits datatype as before
Now changes the processes view of the data in the file using set_view
set_view is collective
Reads are now collective
35Computer Science, University of WarwickComputer Science, University of Warwick
These access patterns express four different style of parallel I/O that are available
You should choose your access pattern depending on the application
Larger the size of the I/O request, the better performance
Collectives are going to do better than individual reads
Pattern 4 therefore offers (potentially) the best performance
Access patternsAccess patterns
36Computer Science, University of WarwickComputer Science, University of Warwick
I/O optimization: Data SievingI/O optimization: Data Sieving
Data sieving is used to combine lots of small accesses into a single larger one
Remote file systems (parallel or not) tend to have high latencies
Reducing the number of operations important
37Computer Science, University of WarwickComputer Science, University of Warwick
I/O optimization: Data Sieving WritesI/O optimization: Data Sieving Writes
Using data sieving for writes is more complicated
Must read the entire region first
Then make our changes
Then write the block back
Requires locking in the file system
Can result in false sharing
38Computer Science, University of WarwickComputer Science, University of Warwick
I/O optimization: Two-Phase Collective I/O optimization: Two-Phase Collective I/OI/O
Problems with independent, noncontiguous access Lots of small accesses
Independent data sieving reads lots of extra data
Idea: Reorganize access to match layout on disks Single processes use data sieving to get data for many
Often reduces total I/O through sharing of common blocks
Second ``phase'' moves data to final destinations
39Computer Science, University of WarwickComputer Science, University of Warwick
I/O optimizationI/O optimization: Collective I/O: Collective I/O
Collective I/O is coordinated access to storage by a group of processes
Collective I/O functions must be called by all processes participating in I/O
Allows I/O layers to know more about access as a whole