Automatic Tuning of Parallel Read and Write Performance€¦ ·...
Transcript of Automatic Tuning of Parallel Read and Write Performance€¦ ·...
Lawrence Berkeley National Laboratory Creative Services Office (CSO) #25378 CRD_PowerPoint_Poster_36x48_Template
Motivation Performance Modeling and Tuning Writes Tuning Read Performance
Suren Byna
Parallel I/O
Automatic Tuning of Parallel Read and Write Performance
Nonlinear regression model
• An essential component of modern HPC - Scientific simulations periodically write their state as checkpoint datasets - Input and output datasets have become larger and larger • Optimal I/O performance is critical for utilizing the power of HPC machines
• We consider smooth, nonlinear models, which can be written as linear combinations of nb nonlinear basis functions φ,
(Eq. 1)
• Once a basis φ has been selected, the hyper-parameters β can be
selected by standard regression-/optimization-based approaches.
• There can be many choices of basis functions; for simplicity, we focus on terms that are low-degree polynomials in either the parameter, xi , or the inverse of the parameter, 1/xi:
(Eq. 2)
• One of our goals in building a model: Simplicity
- Incorporate only a handful of basis terms, nb, from the set (Eq. 2). • Given an initially empty set, we follow a greedy procedure (forward model
selection approach) of adding to this set (P) the terms that most reduces the prediction error. Formally, p being a term, this means we determine those p that solves
(Eq. 3)
• We do this until we reach a desired limit on the number of terms to
include.
This is a visualization of the 1 trillion-electron dataset at time step 1905. All of the particles with energy > 1.3 are shown in grey, while particles with energy > 1.5 are shown in color. A total of ~165 million particles with energy > 1.3 and ~423 thousand particles with energy > 1.5 appear to be accelerated preferentially along the direction of the mean magnetic field corresponding to formation of four jets.
Complex Parallel I/O Software Stack
• Different optimization parameters at each layer of the I/O software stack
• Complex inter-dependencies between these layers
• Highly dependent on the application, HPC platform, and problem size/concurrency
• Application developers need good I/O performance without becoming experts on the I/O
Results 1.0
0.50.0
-0.5-1.0 Ux
Uz
Uy
1.0
0.5
0.0
-0.5
-1.0
-1.0
-0.5
0.0
0.5
1.0
1.880
Energy
1.735
1.590
1.445
1.300
m(x;β) = βkφk (x)k=1
nb
∑
(xi )pi
i=1
nx
∏ : pi ∈{−1,0,1}, i=1,...,nx
minp∈{−1,0,1}nx
m^(x j; P∪ p) − y j
y j$
%
&&
'
(
))
2
j=1
ny
∑
Contributors: Prabhat, John Wu, Bing Dong (LBNL), Babak Behzad and Mark Snir (UIUC), Houjun Tang, Chris Zou, Nagiza Samatova (NCSU), Quincey Koziol (The HDF Group)
This work was supported by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. The authors would like to thank NERSC and Cray staff for troubleshooting I/O issues on Hopper and Edison. We also thank TACC and Stampede admin staff. We would also like to thank members of the HDF Group and Burlen Loring and Oliver Rübel of the Visualization group at LBNL CRD.
Tuning Writing Data with Model-based training
Overview of DynamicModel-driven I/O tuning
Exploration
Pruning
Model Generation
HPC System
Training Phase
Storage System
Develop anI/O Model
Training Set
I/O Kernel
Top k Configurations
Refi
t the
mod
el(C
ontro
led
by u
ser)
Performance ResultsSelect the Best
Performing Configuration
I/O ModelAll Possible
Values
Refitting
• Three Phases Ø Training phase: Using a
training set, develop an empirical write performance model
Ø Pruning phase: Using the model predicted write times select the top k configurations that have the best performance. Store the tuned parameters of the best performing configuration
Ø Refitting phase (optional): Refit the model to obtain the current characteristics of the I/O system
Results
0.1
0.3
0.4
1
10
20
30
40
VPIC-IO VORPAL-IO GCRM-IO
I/O
Ba
ndw
idth
(G
B/s
)
EdisonHopper
StampedeDefault VPIC-IO
Default VORPAL I/ODefault GCRM I/O
0.1
0.2
0.3
1
10
20
30
VPIC-IO VORPAL-IO GCRM-IO
I/O
Ban
dw
idth
(G
B/s
)
EdisonHopper
Default VPIC-IODefault VORPAL I/O
0
5
10
15
20
25
30
VPIC-IO VORPAL-IO GCRM-IO
I/O
Ban
dw
idth
(G
B/s
)
EdisonHopper
Performance improvement
Impact of Aggregators to Stripe Count ratio
55x� 31x� 21x�
0.00(
20.00(
40.00(
60.00(
80.00(
100.00(
120.00(
140.00(
[:,(:,(100:600]( [:,(:,(4569:5789]( [:,(:,(20000:30000](
Time(to(re
ad(data((sec.)(
Query(
SDS(Reorganized(File( Original(File(
• A strategy for collecting read traces and analyze for patterns • Based on the observed patterns, create partial replicas of data in a layout
that similar future read patterns would benefit • Transparently redirect future reads to the replicas
- Original dataset is a 3D in xyz dimensions - Replicated dataset is transposed to zxy dimensions
Performance improvement w/ Electrocorticography (ECoG) I/O kernel