Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 )...
Transcript of Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 )...
Matthieu Dorier [email protected] KerData Team Inria Rennes, IRISA ENS Cachan
Matthieu Dorier, Gabriel Antoniu, Franck Cappello, Marc Snir, Leigh Orf
Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O
IEEE CLUSTER 2012 – Beijing September 24 - 28
September 25th 2012
Context: HPC simulation on Blue Waters
• INRIA – UIUC - ANL Joint Lab for Petascale Computing
• Targeting large-scale simulation of unprecedented accuracy
• One of our challenges: At very large scale, scalability of HPC simulations is mainly driven by I/O
September 25th 2012 - 2 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
Writing from HPC simulations: File per process vs. Collective I/O
- 3
• Two main approaches for I/O in HPC simulations • Implemented in MPI-I/O, available in HDF5, NetCDF,…
September 25th 2012
• Too many files • High metadata overhead • Hard to read back
• Requires coordination • Data communications
Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
Users attempt to overcome performance issues
- 4
Image credit: Phil Carns (ANL), “Understanding and Improving Computational Science Storage Access through Continuous Characterization”
September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
File-per-process
Collective
“Hybrid”
Periodic synchronous snapshots lead to I/O bursts and high variability (jitter)
- 5 September 25th 2012
The “cardiogram” of a PVFS data server during a run of the CM1 simulation
Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
Input Output
Not only are performances bad, they’re also unpredictable!
- 6
I/O variability, or “jitter” • Write time unpredictability
- Between two processes - Between two I/O phases - Between simulations
• Due to - Cross-applications interferences - Many processes writing at the same time - Different amounts of data - File system’s configuration - …
September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
Can we hide the I/O jitter?
Yes: Damaris
- 7 September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
- 8
DAMARIS “Dedicated Adaptable Middleware for Application Resources Inline Steering”
September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
- 9
• Next-generation supercomputers have multicore SMP nodes
• Network access contention at the level of a node • Possibility of efficient interactions through shared memory • One core dedicated for gathering data • This core only writes
The Damaris approach: dedicated I/O cores
Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
September 25th 2012
Moving jitter to a dedicated core
- 10 September 25th 2012
Leave a core, go faster!
Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
Damaris at a glance
- 11 September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
Damaris at a glance
- 12 September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
Damaris at a glance
- 13 September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
Damaris at a glance
- 14 September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
Damaris at a glance
- 15 September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
Application Programming Interface
- 16 September 25th 2012
• Initializing, finalizing - DC_initialize(“config.xml”,rank), DC_finalize()- DC_mpi_init_and_start(“config.xml”, communicator)
• Writing data - DC_write(“group/variable”,iteration,data)- DC_chunk_write(chunk,“group/variable”,iteration,data)
• Sending events - DC_signal(“event name”,iteration)
• Direct access to shared memory - DC_alloc(“group/variable”,iteration) à void*
- DC_commit(“group/variable”,iteration)
Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
Damaris: current state of the software
- 17
• Version 0.6 available at http://damaris.gforge.inria.fr/ - Along with documentation, tutorials and examples
• Written in C++, uses - Boost for IPC, Xerces-C and XSD for XML parsing
• API for Fortran, C, C++
• Tested on - Debian Linux, Kraken (Cray XT5 - NICS), JaguarPF (Cray XK6 - ORNL), JYC (Blue Waters testing system: Cray XE6 - NCSA), Intrepid (BlueGene/P – ANL)
• Tested with - CM1 (climate), OLAM (climate), GTC (fusion), Nek5000 (CFD)
September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
- 18
Experimental evaluation
September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
Running the CM1 simulation on Kraken, G5K and BluePrint with Damaris
- 19
• The CM1 simulation - Atmospheric simulation - One of the Blue Waters target applications - Uses HDF5 (file-per-process) and pHDF5 (for collective I/O)
• Kraken - Cray XT5 at NICS - 12 cores/node - 16 GB/node - Luster file system
September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
• Grid 5000 - 24 cores/node - 48 GB/node - PVFS file system
• BluePrint - Power5 - 16 cores/node - 64 GB/node - GPFS file system
Damaris achieves almost perfect scalability
- 20
0
2000
4000
6000
8000
10000
0 5000 10000 Sc
alab
ility
fact
or
Number of cores
Perfect scaling
Damaris
File-per-process
Collective-I/O
Weak scalability factor S = N TbaseT
N: number of cores Tbase: time of an iteration on one core w/o write T: time of an iteration + a write
0
200
400
600
800
1000
576 2304 9216
Run
tim
e (s
ec)
Number of cores
September 25th 2012
Kraken Cray XT5 Application run time
(50 iterations + 1 write phase)
Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
- 21
0 100 200 300 400 500 600 700 800 900
576 2304 9216
Tim
e to
writ
e (s
ec)
Number of cores
Collective-I/O File-per-process Damaris
0
2
4
6
8
10
0 10 20 30
Tim
e to
writ
e (s
ec)
Total amount of data (GB)
Kraken Cray XT5 Average and maximum write time
28MB per process
BluePrint Power5, 1024 cores Average, max and min write time
Damaris hides the I/O jitter
September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
- 22
0.0625 0.125
0.25 0.5
1 2 4 8
16
0 5000 10000
Agg
rega
te th
roug
hput
(G
B/s
)
Number of cores
File-per-process Damaris Collective-I/O
Average aggregate throughput from the writer processes
Damaris increases effective throughput
September 25th 2012
Kraken Cray XT5 Average aggregate throughput
from writers
Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
- 23
0 50
100 150 200 250 300
576 2304 9216
Tim
e (s
ec)
Number of cores
0
50
100
150
200
250
0.05 5.8 15.1 24.7 Ti
me
(sec
) Total amount of data (GB)
Spare time Used time
BluePrint Power5 (1024 cores) Kraken Cray XT5
Time spent by Damaris writing data and time spent waiting
Damaris spares time? Let’s use it!
September 25th 2012
Damaris spares time for data management
Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
- 24 September 25th 2012
Use of the spare time: Adding compression and scheduling
0
5
10
15
20
25
30
35
Grid5000 (912 cores) Kraken (2304 cores)
Writ
e tim
e (s
ec)
No Compression With compression With Scheduling
• With compression enabled: up to 600 percent compression ratio! - (reduction to 16bits float values and lossless gzip level 4)
• Scheduling - each dedicated core waits for its time slot before writing
Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
Conclusion and future work
- 25
• The Damaris approach - Dedicated cores - Shared memory - Highly adaptable system thanks to plugins
• Results - Fully hides the I/O jitter and I/O-related costs - 15x sustained write throughput (compared to collective I/O) - Almost perfect application scalability - Execution time divided by 3.5 compared to collective I/O - Overhead-free 600% compression ratio
• Future work - Efficient coupling of simulation and analysis tools - Distributed I/O scheduling on dedicated cores
September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O
Matthieu Dorier [email protected] KerData Team Inria Rennes, IRISA ENS Cachan
Matthieu Dorier, Gabriel Antoniu, Franck Cappello, Marc Snir, Leigh Orf
Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O
IEEE CLUSTER 2012 – Beijing September 24 - 28
September 25th 2012
Thank you 谢谢
Merci