Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
-
Upload
theodore-dean -
Category
Documents
-
view
214 -
download
0
Transcript of Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
Collective Buffering: Improving Parallel I/O Performance
By
Bill Nitzberg and Virginia Lo
Outline
• Introduction
• Concepts
• Collective parallel I/O algorithms
• Collective buffering experiments
• Conclusion
• Question
Introduction
• Existing parallel I/O system evolved directly from I/O system for serial machines
• Serial I/O systems are heavily tuned for:– Sequential, large accesses, limited file
sharing between processes– High degree of both spatial and temporal
locality
Introduction (cont.)
• This paper presents a set of algorithms known as Collective Buffering algorithms
• These algorithms seeks to improve I/O performance on distributed memory machines by utilizing global knowledge of the I/O operations
Concepts
• Global data structure– Global data structure is the logical view of the
data from the application’s point of view– Scientific applications generally use global
data structures consisting of arrays distributed in one, two, or three dimensions
Concepts (cont.)
• Data distribution– The global data structure is distributed among
node memories by cutting it into data chunks.– The HPF BLOCK distribution partitions the
global data structure into P equally sized pieces
– The HPF CYCLIC divides the global data structure into small pieces (by distribution size or block size) and deals these pieces out to the P nodes in a round-robin fashion
Concepts (cont.)
Concepts (cont.)
• File layout– File layout is another form of data distribution– The file represents a linearization of the global
data structures, such as the row-major ordering of a three-dimensional array
– This linearization is called canonical file– The file are distributed among I/O nodes
Concepts (cont.)
Collective parallel I/O algorithm
• Naïve algorithm– Naïve algorithm treats parallel I/O the same
as workstation I/O– The order of writes is dependent on data
layout in node’s memory which as no relation to the layout of data on disks
– The unit of data transferred in each I/O operation is the data block – the smallest unit of local data that is contiguous with respect to the canonical file
Collective parallel I/O algorithm (cont.)
• Naïve algorithm (cont.)– The size of the data block is very small and is
unrelated to the size of a file block because of the disparity between data distributions and file layout parameters
– The overall effect are:• The network is flood with many small messages• Messages arrive at I/O nodes in an uncoordinated
fashion resulting in highly inefficient disk writes
Collective parallel I/O algorithms (cont.)
Collective parallel I/O algorithms (cont.)
• Collective buffering algorithm– This method rearranges the data on compute
nodes prior to issuance of I/O operations to minimize the number of disk operations
– The permutation can be performed “in place” where nodes transpose data among them self
– It can also be performed “on auxiliary nodes” where the compute nodes transpose the data by sending it to a set of auxiliary buffering nodes
Collective parallel I/O algorithms (cont.)
Collective parallel I/O algorithms (cont.)
Collective parallel I/O algorithms (cont.)
• Four techniques are developed and evaluated:1 - All compute nodes are used to permute the
data to a simple HPF BLOCK intermediate distribution in a single step
2 – Refine the first technique by realistically limiting the amount of buffer space and using a distribution which matches the file layout
Collective parallel I/O algorithms (cont.)
• Four techniques (cont.):– This technique uses HPF CYCLIC
intermediate distribution– This method uses scatter/gather hardware to
eliminate the latency dominated overhead of the permutation phase
Collective buffering experiments
• Experiment systems:– The Paragon consists of 224 processing
nodes connected in a 16x32 mesh. – Application space-share 208 compute nodes
with 32 MB of memory each. – Nine I/O nodes each with one SCSI-1 RAID-3
disk array consisting of 5 disks, 2 gigabytes each.
– The parallel file system, PFS is configured to use 6 of the 9 I/O nodes
Collective buffering experiments (cont.)
• Experiments systems:– The SP2 consists of 160 nodes. Each node is
an IBM RS6000/590 with 128 MB of memory and a SCSI-1 attached 2 GB disk
– The Parallel file system, IBM AIX Parallel I/O File System (PIOFS) is configured with 8 I/O nodes (semi-dedicated servers) and 150 compute nodes
Collective buffering experiments (cont.)
Collective buffering experiments (cont.)
Collective buffering experiments (cont.)
Collective buffering experiments (cont.)
Collective buffering experiments (cont.)
Collective buffering experiments (cont.)
Collective buffering experiments (cont.)
Collective buffering experiments (cont.)
Collective buffering experiments (cont.)
Collective buffering experiments (cont.)
Collective buffering experiments (cont.)
Conclusion
• Collective buffering significantly improves Naïve parallel I/O performance by two orders of magnitude for small data block sizes
• Peak performance can be obtained with minimal buffer space (approximately 1 megabyte per I/O node)
• Performance is dependent on intermediate distribution (up to a factor of 2)
Conclusion (cont.)
• There is no single intermediate distribution which provides the best performance for all cases, but a few come close
• Collective buffering with scatter/gather can potentially deliver peak performance for all data block sizes.
Question
• What is the advantages and disadvantages of the Naïve algorithm ?
• What is Collective Buffering and how this technique may improve parallel I/O performance ?