Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

33
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo

Transcript of Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Page 1: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective Buffering: Improving Parallel I/O Performance

By

Bill Nitzberg and Virginia Lo

Page 2: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Outline

• Introduction

• Concepts

• Collective parallel I/O algorithms

• Collective buffering experiments

• Conclusion

• Question

Page 3: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Introduction

• Existing parallel I/O system evolved directly from I/O system for serial machines

• Serial I/O systems are heavily tuned for:– Sequential, large accesses, limited file

sharing between processes– High degree of both spatial and temporal

locality

Page 4: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Introduction (cont.)

• This paper presents a set of algorithms known as Collective Buffering algorithms

• These algorithms seeks to improve I/O performance on distributed memory machines by utilizing global knowledge of the I/O operations

Page 5: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Concepts

• Global data structure– Global data structure is the logical view of the

data from the application’s point of view– Scientific applications generally use global

data structures consisting of arrays distributed in one, two, or three dimensions

Page 6: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Concepts (cont.)

• Data distribution– The global data structure is distributed among

node memories by cutting it into data chunks.– The HPF BLOCK distribution partitions the

global data structure into P equally sized pieces

– The HPF CYCLIC divides the global data structure into small pieces (by distribution size or block size) and deals these pieces out to the P nodes in a round-robin fashion

Page 7: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Concepts (cont.)

Page 8: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Concepts (cont.)

• File layout– File layout is another form of data distribution– The file represents a linearization of the global

data structures, such as the row-major ordering of a three-dimensional array

– This linearization is called canonical file– The file are distributed among I/O nodes

Page 9: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Concepts (cont.)

Page 10: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective parallel I/O algorithm

• Naïve algorithm– Naïve algorithm treats parallel I/O the same

as workstation I/O– The order of writes is dependent on data

layout in node’s memory which as no relation to the layout of data on disks

– The unit of data transferred in each I/O operation is the data block – the smallest unit of local data that is contiguous with respect to the canonical file

Page 11: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective parallel I/O algorithm (cont.)

• Naïve algorithm (cont.)– The size of the data block is very small and is

unrelated to the size of a file block because of the disparity between data distributions and file layout parameters

– The overall effect are:• The network is flood with many small messages• Messages arrive at I/O nodes in an uncoordinated

fashion resulting in highly inefficient disk writes

Page 12: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective parallel I/O algorithms (cont.)

Page 13: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective parallel I/O algorithms (cont.)

• Collective buffering algorithm– This method rearranges the data on compute

nodes prior to issuance of I/O operations to minimize the number of disk operations

– The permutation can be performed “in place” where nodes transpose data among them self

– It can also be performed “on auxiliary nodes” where the compute nodes transpose the data by sending it to a set of auxiliary buffering nodes

Page 14: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective parallel I/O algorithms (cont.)

Page 15: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective parallel I/O algorithms (cont.)

Page 16: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective parallel I/O algorithms (cont.)

• Four techniques are developed and evaluated:1 - All compute nodes are used to permute the

data to a simple HPF BLOCK intermediate distribution in a single step

2 – Refine the first technique by realistically limiting the amount of buffer space and using a distribution which matches the file layout

Page 17: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective parallel I/O algorithms (cont.)

• Four techniques (cont.):– This technique uses HPF CYCLIC

intermediate distribution– This method uses scatter/gather hardware to

eliminate the latency dominated overhead of the permutation phase

Page 18: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective buffering experiments

• Experiment systems:– The Paragon consists of 224 processing

nodes connected in a 16x32 mesh. – Application space-share 208 compute nodes

with 32 MB of memory each. – Nine I/O nodes each with one SCSI-1 RAID-3

disk array consisting of 5 disks, 2 gigabytes each.

– The parallel file system, PFS is configured to use 6 of the 9 I/O nodes

Page 19: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective buffering experiments (cont.)

• Experiments systems:– The SP2 consists of 160 nodes. Each node is

an IBM RS6000/590 with 128 MB of memory and a SCSI-1 attached 2 GB disk

– The Parallel file system, IBM AIX Parallel I/O File System (PIOFS) is configured with 8 I/O nodes (semi-dedicated servers) and 150 compute nodes

Page 20: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective buffering experiments (cont.)

Page 21: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective buffering experiments (cont.)

Page 22: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective buffering experiments (cont.)

Page 23: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective buffering experiments (cont.)

Page 24: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective buffering experiments (cont.)

Page 25: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective buffering experiments (cont.)

Page 26: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective buffering experiments (cont.)

Page 27: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective buffering experiments (cont.)

Page 28: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective buffering experiments (cont.)

Page 29: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective buffering experiments (cont.)

Page 30: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Collective buffering experiments (cont.)

Page 31: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Conclusion

• Collective buffering significantly improves Naïve parallel I/O performance by two orders of magnitude for small data block sizes

• Peak performance can be obtained with minimal buffer space (approximately 1 megabyte per I/O node)

• Performance is dependent on intermediate distribution (up to a factor of 2)

Page 32: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Conclusion (cont.)

• There is no single intermediate distribution which provides the best performance for all cases, but a few come close

• Collective buffering with scatter/gather can potentially deliver peak performance for all data block sizes.

Page 33: Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Question

• What is the advantages and disadvantages of the Naïve algorithm ?

• What is Collective Buffering and how this technique may improve parallel I/O performance ?