Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Post on 01-Jan-2016

214 views 0 download

Tags:

Transcript of Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective Buffering: Improving Parallel I/O Performance

By

Bill Nitzberg and Virginia Lo

Outline

• Introduction

• Concepts

• Collective parallel I/O algorithms

• Collective buffering experiments

• Conclusion

• Question

Introduction

• Existing parallel I/O system evolved directly from I/O system for serial machines

• Serial I/O systems are heavily tuned for:– Sequential, large accesses, limited file

sharing between processes– High degree of both spatial and temporal

locality

Introduction (cont.)

• This paper presents a set of algorithms known as Collective Buffering algorithms

• These algorithms seeks to improve I/O performance on distributed memory machines by utilizing global knowledge of the I/O operations

Concepts

• Global data structure– Global data structure is the logical view of the

data from the application’s point of view– Scientific applications generally use global

data structures consisting of arrays distributed in one, two, or three dimensions

Concepts (cont.)

• Data distribution– The global data structure is distributed among

node memories by cutting it into data chunks.– The HPF BLOCK distribution partitions the

global data structure into P equally sized pieces

– The HPF CYCLIC divides the global data structure into small pieces (by distribution size or block size) and deals these pieces out to the P nodes in a round-robin fashion

Concepts (cont.)

Concepts (cont.)

• File layout– File layout is another form of data distribution– The file represents a linearization of the global

data structures, such as the row-major ordering of a three-dimensional array

– This linearization is called canonical file– The file are distributed among I/O nodes

Concepts (cont.)

Collective parallel I/O algorithm

• Naïve algorithm– Naïve algorithm treats parallel I/O the same

as workstation I/O– The order of writes is dependent on data

layout in node’s memory which as no relation to the layout of data on disks

– The unit of data transferred in each I/O operation is the data block – the smallest unit of local data that is contiguous with respect to the canonical file

Collective parallel I/O algorithm (cont.)

• Naïve algorithm (cont.)– The size of the data block is very small and is

unrelated to the size of a file block because of the disparity between data distributions and file layout parameters

– The overall effect are:• The network is flood with many small messages• Messages arrive at I/O nodes in an uncoordinated

fashion resulting in highly inefficient disk writes

Collective parallel I/O algorithms (cont.)

Collective parallel I/O algorithms (cont.)

• Collective buffering algorithm– This method rearranges the data on compute

nodes prior to issuance of I/O operations to minimize the number of disk operations

– The permutation can be performed “in place” where nodes transpose data among them self

– It can also be performed “on auxiliary nodes” where the compute nodes transpose the data by sending it to a set of auxiliary buffering nodes

Collective parallel I/O algorithms (cont.)

Collective parallel I/O algorithms (cont.)

Collective parallel I/O algorithms (cont.)

• Four techniques are developed and evaluated:1 - All compute nodes are used to permute the

data to a simple HPF BLOCK intermediate distribution in a single step

2 – Refine the first technique by realistically limiting the amount of buffer space and using a distribution which matches the file layout

Collective parallel I/O algorithms (cont.)

• Four techniques (cont.):– This technique uses HPF CYCLIC

intermediate distribution– This method uses scatter/gather hardware to

eliminate the latency dominated overhead of the permutation phase

Collective buffering experiments

• Experiment systems:– The Paragon consists of 224 processing

nodes connected in a 16x32 mesh. – Application space-share 208 compute nodes

with 32 MB of memory each. – Nine I/O nodes each with one SCSI-1 RAID-3

disk array consisting of 5 disks, 2 gigabytes each.

– The parallel file system, PFS is configured to use 6 of the 9 I/O nodes

Collective buffering experiments (cont.)

• Experiments systems:– The SP2 consists of 160 nodes. Each node is

an IBM RS6000/590 with 128 MB of memory and a SCSI-1 attached 2 GB disk

– The Parallel file system, IBM AIX Parallel I/O File System (PIOFS) is configured with 8 I/O nodes (semi-dedicated servers) and 150 compute nodes

Collective buffering experiments (cont.)

Collective buffering experiments (cont.)

Collective buffering experiments (cont.)

Collective buffering experiments (cont.)

Collective buffering experiments (cont.)

Collective buffering experiments (cont.)

Collective buffering experiments (cont.)

Collective buffering experiments (cont.)

Collective buffering experiments (cont.)

Collective buffering experiments (cont.)

Collective buffering experiments (cont.)

Conclusion

• Collective buffering significantly improves Naïve parallel I/O performance by two orders of magnitude for small data block sizes

• Peak performance can be obtained with minimal buffer space (approximately 1 megabyte per I/O node)

• Performance is dependent on intermediate distribution (up to a factor of 2)

Conclusion (cont.)

• There is no single intermediate distribution which provides the best performance for all cases, but a few come close

• Collective buffering with scatter/gather can potentially deliver peak performance for all data block sizes.

Question

• What is the advantages and disadvantages of the Naïve algorithm ?

• What is Collective Buffering and how this technique may improve parallel I/O performance ?