Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 )...

Matthieu Dorier [email protected] KerData Team Inria Rennes, IRISA ENS Cachan

Matthieu Dorier, Gabriel Antoniu, Franck Cappello, Marc Snir, Leigh Orf

Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O

IEEE CLUSTER 2012 – Beijing September 24 - 28

September 25th 2012

Context: HPC simulation on Blue Waters

•  INRIA – UIUC - ANL Joint Lab for Petascale Computing

•  Targeting large-scale simulation of unprecedented accuracy

•  One of our challenges: At very large scale, scalability of HPC simulations is mainly driven by I/O

September 25th 2012 - 2 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Writing from HPC simulations: File per process vs. Collective I/O

- 3

•  Two main approaches for I/O in HPC simulations •  Implemented in MPI-I/O, available in HDF5, NetCDF,…

September 25th 2012

•  Too many files •  High metadata overhead •  Hard to read back

•  Requires coordination •  Data communications

Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Users attempt to overcome performance issues

- 4

Image credit: Phil Carns (ANL), “Understanding and Improving Computational Science Storage Access through Continuous Characterization”

September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

File-per-process

Collective

“Hybrid”

Periodic synchronous snapshots lead to I/O bursts and high variability (jitter)

- 5 September 25th 2012

The “cardiogram” of a PVFS data server during a run of the CM1 simulation


Input Output

Not only are performances bad, they’re also unpredictable!

- 6

I/O variability, or “jitter” •  Write time unpredictability

- Between two processes - Between two I/O phases - Between simulations

•  Due to - Cross-applications interferences - Many processes writing at the same time - Different amounts of data - File system’s configuration - …


Can we hide the I/O jitter?

Yes: Damaris

- 7 September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

- 8

DAMARIS “Dedicated Adaptable Middleware for Application Resources Inline Steering”


- 9

•  Next-generation supercomputers have multicore SMP nodes

•  Network access contention at the level of a node •  Possibility of efficient interactions through shared memory •  One core dedicated for gathering data •  This core only writes

The Damaris approach: dedicated I/O cores


September 25th 2012

Moving jitter to a dedicated core


Leave a core, go faster!


Damaris at a glance


Application Programming Interface


•  Initializing, finalizing - DC_initialize(“config.xml”,rank), DC_finalize()- DC_mpi_init_and_start(“config.xml”, communicator)

•  Writing data - DC_write(“group/variable”,iteration,data)- DC_chunk_write(chunk,“group/variable”,iteration,data)

•  Sending events - DC_signal(“event name”,iteration)

•  Direct access to shared memory - DC_alloc(“group/variable”,iteration) à void*

- DC_commit(“group/variable”,iteration)


Damaris: current state of the software

- 17

•  Version 0.6 available at http://damaris.gforge.inria.fr/ - Along with documentation, tutorials and examples

•  Written in C++, uses - Boost for IPC, Xerces-C and XSD for XML parsing

•  API for Fortran, C, C++

•  Tested on - Debian Linux, Kraken (Cray XT5 - NICS), JaguarPF (Cray XK6 - ORNL), JYC (Blue Waters testing system: Cray XE6 - NCSA), Intrepid (BlueGene/P – ANL)

•  Tested with - CM1 (climate), OLAM (climate), GTC (fusion), Nek5000 (CFD)


- 18

Experimental evaluation


Running the CM1 simulation on Kraken, G5K and BluePrint with Damaris

- 19

•  The CM1 simulation - Atmospheric simulation - One of the Blue Waters target applications - Uses HDF5 (file-per-process) and pHDF5 (for collective I/O)

•  Kraken - Cray XT5 at NICS - 12 cores/node - 16 GB/node - Luster file system


•  Grid 5000 - 24 cores/node - 48 GB/node - PVFS file system

•  BluePrint - Power5 - 16 cores/node - 64 GB/node - GPFS file system

Damaris achieves almost perfect scalability

- 20

0

2000

4000

6000

8000

10000

0 5000 10000 Sc

alab

ility

fact

or

Number of cores

Perfect scaling

Damaris

File-per-process

Collective-I/O

Weak scalability factor S = N TbaseT

N: number of cores Tbase: time of an iteration on one core w/o write T: time of an iteration + a write

0

200

400

600

800

1000

576 2304 9216

Run

tim

e (s

ec)

Number of cores

September 25th 2012

Kraken Cray XT5 Application run time

(50 iterations + 1 write phase)


- 21

0 100 200 300 400 500 600 700 800 900

576 2304 9216

Tim

e to

writ

e (s

ec)

Number of cores

Collective-I/O File-per-process Damaris

0

2

4

6

8

10

0 10 20 30

Tim

e to

writ

e (s

ec)

Total amount of data (GB)

Kraken Cray XT5 Average and maximum write time

28MB per process

BluePrint Power5, 1024 cores Average, max and min write time

Damaris hides the I/O jitter


- 22

0.0625 0.125

0.25 0.5

1 2 4 8

16

0 5000 10000

Agg

rega

te th

roug

hput

(G

B/s

)

Number of cores

File-per-process Damaris Collective-I/O

Average aggregate throughput from the writer processes

Damaris increases effective throughput

September 25th 2012

Kraken Cray XT5 Average aggregate throughput

from writers


- 23

0 50

100 150 200 250 300

576 2304 9216

Tim

e (s

ec)

Number of cores

0

50

100

150

200

250

0.05 5.8 15.1 24.7 Ti

me

(sec

) Total amount of data (GB)

Spare time Used time

BluePrint Power5 (1024 cores) Kraken Cray XT5

Time spent by Damaris writing data and time spent waiting

Damaris spares time? Let’s use it!

September 25th 2012

Damaris spares time for data management



Use of the spare time: Adding compression and scheduling

0

5

10

15

20

25

30

35

Grid5000 (912 cores) Kraken (2304 cores)

Writ

e tim

e (s

ec)

No Compression With compression With Scheduling

•  With compression enabled: up to 600 percent compression ratio! - (reduction to 16bits float values and lossless gzip level 4)

•  Scheduling - each dedicated core waits for its time slot before writing


Conclusion and future work

- 25

•  The Damaris approach - Dedicated cores - Shared memory - Highly adaptable system thanks to plugins

•  Results - Fully hides the I/O jitter and I/O-related costs - 15x sustained write throughput (compared to collective I/O) - Almost perfect application scalability - Execution time divided by 3.5 compared to collective I/O - Overhead-free 600% compression ratio

•  Future work - Efficient coupling of simulation and analysis tools - Distributed I/O scheduling on dedicated cores


Matthieu Dorier [email protected] KerData Team Inria Rennes, IRISA ENS Cachan

Matthieu Dorier, Gabriel Antoniu, Franck Cappello, Marc Snir, Leigh Orf

Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O

IEEE CLUSTER 2012 – Beijing September 24 - 28

September 25th 2012

Thank you 谢谢

Merci

Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 )...

Documents

Transcript of Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 )...