Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar...

Vishwanath Venkatesan

Design and Evaluation of Non-Blocking Collective I/O Operations

Vishwanath Venkatesan1, Edgar Gabriel1

1 Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston

<venkates, gabriel>@cs.uh.edu

2/17/12 1

mailto:[email protected]



Vishwanath Venkatesan 2

Outline• I/O Challenge in HPC• MPI File I/O • Non-blocking Collective Operations• Non-blocking Collective I/O Operations• Experimental results• Conclusion

2/17/12


I/O Challenge in HPC• A 2005 Paper from LLNL [1] states

– Applications on leadership class machines require 1 GB/s I/O Bandwidth per teraflop of computing capability

• Jaguar of ORNL , (Fastest in 2008)– Excess 250 Teraflops peak compute performance with peak I/O

performance of 72 GB/s [3]• Fastest Supercomputer K, (2011)

– 10 Petaflops (nearly) peak compute performance with realized I/O bandwidth of 96 GB/s [2]

2/17/12

[1] Richard Hedges, Bill Loewe, T. McLarty, and Chris Morrone. Parallel File System Testing for the Lunatic Fringe: the care and feeding of restless I/O Power Users, In Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (2005)[2] Shinji Sumimoto. An Overview of Fujitsu’s Lustre Based File System. Technical report, Fujitsu, 2011[3] M. Fahey, J. Larkin, and J. Adams. I/O performance on a massively parallel Cray XT3/XT4. In Parallel and Distributed Processing,


MPI File I/O• MPI has been de-facto standard for parallel programming in

the last decade• MPI I/O

– File view: portion of a file visible to a process– Individual and collective I/O operations– Example to illustrate the advantage of collective I/O `

2/17/12

• 4 processes accessing a 2D matrix stored in row-major format

• MPI-I/O can detect this access pattern and issue one large I/O request followed by a distribution step for the data among the processes


Non-blocking Collective Operations• Non-blocking Point-to-Point Operations

– Asynchronous data transfer operation– Hide communication latency by overlapping with computation– Demonstrated benefits for a number of applications [1]

• Non-blocking collective communication operations were implemented using LibNBC [2]– Schedule based design: a process-local schedule of p2p operations is

created– Schedule execution is represented as a state machine (with dependencies)– State and schedule are attached to every request

• Non-blocking collective communication operations voted into the upcoming MPI-3 specification [2]

• Non-blocking collective I/O operations not (yet) added to the document.

2/17/12

[1] Buettner. D, Kunkel. J, and Ludwig. T. 2009. Using Non-blocking I/O Operations in High Performance Computing to Reduce Execution Times. In Proceedings of the 16th European PVM/MPI Users[2] Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and Performance Analysisof Non-Blocking Collective Operations for MPI, Supercomputing 2007/


Non-blocking collective I/O Operations

MPI_File_iwrite_all (MPI_File file,void *buf, int cnt, MPI_Datatyep dt,MPI_Request *request)

• Different from Non-blocking collective communication operations– Every process is allowed to provide different amounts of data per collective

read/write operation– No process has a ‘global’ view how much data is read/written

• Create a schedule for a non-blocking All-gather(v) – Determine the overall amount of data written across all

processes– Determine the offsets for each data item within each group

• Upon completion: – Create a new schedule for the shuffle and I/O steps– Schedule can consist of multiple cycles

2/17/12


Experimental Evaluation

2/17/12

• Crill cluster at the University of Houston– Distributed PVFS2 file system using with 16 I/O servers– 4x SDR InfiniBand message passing network (2 ports per node)– Gigabit Ethernet I/O network– 18 nodes, 864 compute cores

• LibNBC integrated with OpenMPI trunk rev. 24640• Focusing on collective write operations


Latency I/O Overlap Tests• Overlapping non-blocking coll. I/O operation with equally

expensive compute operation– Best case: overall time = max (I/O time, compute time)

• Strong dependence on ability to make progress– Best case: time between subsequent calls to NBC_Test =

time to execute one cycle of coll. I/O

2/17/12

No. of processes I/O time Time spent in computation

Overall time

64 85.69 sec 85.69 sec 85.80 sec

128 205.39 sec 205.39 sec 205.91 sec


Parallel Image Segmentation Application

• Used to assist in diagnosing thyroid cancer• Based on microscopic images obtained through Fine Needle

Aspiration (FNA) [1]• Executes convolution operation for different filters and writes

data• Code modified to overlap write of iteration i with computations

of iteration i+1• Two code versions generated:

– NBC: Additional calls to progress engine added between different code blocks

– NBC w/FFTW: Modified FFTW to insert further calls to progress engine

2/17/12

[1] Edgar Gabriel,Vishwanath Venkatesan and Shishir Shah, Towards High Performance Cell Segmentation in Multispectral Fine Needle Aspiration Cytology of Thyroid Lesions. Computer Methods and Programs in Biomedicine, 2009.


Application Results

2/17/12

• 8192 x 8192 pixels, 21 spectral channels• 1.3 GB input data, ~3 GB output data• 32 aggregators with 4 MB cycle buffer size


Conclusions• Specification of non-blocking collective I/O operations straight

forward• Implementation challenging, but doable• Results show strong dependence on the ability to make

progress – (Nearly) perfect for micro benchmark– Mostly good results with application scenario

• Is up for first voting in the MPI Forum.

2/17/12

Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar...

Documents

Transcript of Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar...