Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar...
-
Upload
anastasia-hunt -
Category
Documents
-
view
212 -
download
0
Transcript of Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar...
Vishwanath Venkatesan
Design and Evaluation of Non-Blocking Collective I/O Operations
Vishwanath Venkatesan1, Edgar Gabriel1
1 Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston
<venkates, gabriel>@cs.uh.edu
2/17/12 1
Vishwanath Venkatesan 2
Outline• I/O Challenge in HPC• MPI File I/O • Non-blocking Collective Operations• Non-blocking Collective I/O Operations• Experimental results• Conclusion
2/17/12
Vishwanath Venkatesan 3
I/O Challenge in HPC• A 2005 Paper from LLNL [1] states
– Applications on leadership class machines require 1 GB/s I/O Bandwidth per teraflop of computing capability
• Jaguar of ORNL , (Fastest in 2008)– Excess 250 Teraflops peak compute performance with peak I/O
performance of 72 GB/s [3]• Fastest Supercomputer K, (2011)
– 10 Petaflops (nearly) peak compute performance with realized I/O bandwidth of 96 GB/s [2]
2/17/12
[1] Richard Hedges, Bill Loewe, T. McLarty, and Chris Morrone. Parallel File System Testing for the Lunatic Fringe: the care and feeding of restless I/O Power Users, In Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (2005)[2] Shinji Sumimoto. An Overview of Fujitsu’s Lustre Based File System. Technical report, Fujitsu, 2011[3] M. Fahey, J. Larkin, and J. Adams. I/O performance on a massively parallel Cray XT3/XT4. In Parallel and Distributed Processing,
Vishwanath Venkatesan 4
MPI File I/O• MPI has been de-facto standard for parallel programming in
the last decade• MPI I/O
– File view: portion of a file visible to a process– Individual and collective I/O operations– Example to illustrate the advantage of collective I/O `
2/17/12
• 4 processes accessing a 2D matrix stored in row-major format
• MPI-I/O can detect this access pattern and issue one large I/O request followed by a distribution step for the data among the processes
Vishwanath Venkatesan 5
Non-blocking Collective Operations• Non-blocking Point-to-Point Operations
– Asynchronous data transfer operation– Hide communication latency by overlapping with computation– Demonstrated benefits for a number of applications [1]
• Non-blocking collective communication operations were implemented using LibNBC [2]– Schedule based design: a process-local schedule of p2p operations is
created– Schedule execution is represented as a state machine (with dependencies)– State and schedule are attached to every request
• Non-blocking collective communication operations voted into the upcoming MPI-3 specification [2]
• Non-blocking collective I/O operations not (yet) added to the document.
2/17/12
[1] Buettner. D, Kunkel. J, and Ludwig. T. 2009. Using Non-blocking I/O Operations in High Performance Computing to Reduce Execution Times. In Proceedings of the 16th European PVM/MPI Users[2] Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and Performance Analysisof Non-Blocking Collective Operations for MPI, Supercomputing 2007/
Vishwanath Venkatesan 6
Non-blocking collective I/O Operations
MPI_File_iwrite_all (MPI_File file,void *buf, int cnt, MPI_Datatyep dt,MPI_Request *request)
• Different from Non-blocking collective communication operations– Every process is allowed to provide different amounts of data per collective
read/write operation– No process has a ‘global’ view how much data is read/written
• Create a schedule for a non-blocking All-gather(v) – Determine the overall amount of data written across all
processes– Determine the offsets for each data item within each group
• Upon completion: – Create a new schedule for the shuffle and I/O steps– Schedule can consist of multiple cycles
2/17/12
Vishwanath Venkatesan 7
Experimental Evaluation
2/17/12
• Crill cluster at the University of Houston– Distributed PVFS2 file system using with 16 I/O servers– 4x SDR InfiniBand message passing network (2 ports per node)– Gigabit Ethernet I/O network– 18 nodes, 864 compute cores
• LibNBC integrated with OpenMPI trunk rev. 24640• Focusing on collective write operations
Vishwanath Venkatesan 8
Latency I/O Overlap Tests• Overlapping non-blocking coll. I/O operation with equally
expensive compute operation– Best case: overall time = max (I/O time, compute time)
• Strong dependence on ability to make progress– Best case: time between subsequent calls to NBC_Test =
time to execute one cycle of coll. I/O
2/17/12
No. of processes I/O time Time spent in computation
Overall time
64 85.69 sec 85.69 sec 85.80 sec
128 205.39 sec 205.39 sec 205.91 sec
Vishwanath Venkatesan 9
Parallel Image Segmentation Application
• Used to assist in diagnosing thyroid cancer• Based on microscopic images obtained through Fine Needle
Aspiration (FNA) [1]• Executes convolution operation for different filters and writes
data• Code modified to overlap write of iteration i with computations
of iteration i+1• Two code versions generated:
– NBC: Additional calls to progress engine added between different code blocks
– NBC w/FFTW: Modified FFTW to insert further calls to progress engine
2/17/12
[1] Edgar Gabriel,Vishwanath Venkatesan and Shishir Shah, Towards High Performance Cell Segmentation in Multispectral Fine Needle Aspiration Cytology of Thyroid Lesions. Computer Methods and Programs in Biomedicine, 2009.
Vishwanath Venkatesan 10
Application Results
2/17/12
• 8192 x 8192 pixels, 21 spectral channels• 1.3 GB input data, ~3 GB output data• 32 aggregators with 4 MB cycle buffer size
Vishwanath Venkatesan 11
Conclusions• Specification of non-blocking collective I/O operations straight
forward• Implementation challenging, but doable• Results show strong dependence on the ability to make
progress – (Nearly) perfect for micro benchmark– Mostly good results with application scenario
• Is up for first voting in the MPI Forum.
2/17/12