RAMSES: Robust Analytic Models for Science at Extreme Scales
-
Upload
ian-foster -
Category
Science
-
view
188 -
download
2
description
Transcript of RAMSES: Robust Analytic Models for Science at Extreme Scales
Gagan Agarwal1* Prasanna Balaprakash2 Ian Foster2* Raj Kettimuthu2
Sven Leyffer2 Vitali Morozov2 Todd Munson2 Nagi Rao3*
Saday Sadayappan1 Brad Settlemyer3 Brian Tierney4* Don Towsley5*
Venkat Vishwanath2 Yao Zhang2
1 Ohio State University 2 Argonne National Laboratory 3 Oak Ridge National Laboratory 4 ESnet 5 UMass Amherst (* Co-PIs)
Advanced Scientific Computing Research
Program manager: Rich Carlson♦
2
Source
data
store
Desti-
nation
data
store
Wide Area
Network
Prediction, explanation, & optimization are
challenging for even “simple” E2E workflows
For example, file transfer, for which we want to:
• Predict achievable throughput for a specific configuration
• Explain factors influencing performance
• Optimize parameter values to achieve high speeds
3
Application
OS
FS Stack
HBA/HCA
LANSwitch
Router
Source data transfer node
TCP
IP
NIC
Application
OS
FS Stack
HBA/HCA
LAN
Switch
Router TCP
IP
NIC
Storage Array
Wide Area
Network
OST
MDT
Lustre file system
Destination data transfer node
OSS
OSS
MDS
MDS
Prediction, explanation, & optimization are
challenging for even “simple” E2E workflows
+ diverse environments+ diverse workloads+ contention
85 Gbps sustained disk-to-disk over 100
Gbps network, Ottawa—New Orleans
4
Raj Kettiumuthu
and team,
Argonne
High-speed transfers to/from AWS cloud,
via Globus transfer service
• UChicago AWS S3 (US region): Sustained 2 Gbps
– 2 GridFTP servers, GPFS file system at UChicago
– Multi-part upload via 16 concurrent HTTP connections
• AWS AWS (same region): Sustained 5 Gbps
5
go#s3
6
Endpoint aps#clutch has transfers to 125 other endpoints
Endpoint aps#clutch has transfers to 125 other endpoints
One Advanced
Photon Source
data node:
125 destinations
Same
node
(1 Gbps
link)
9
How to create more accurate, useful, and
portable models of such systems?
Simple analytical model:
T= α+ β*l[startup cost + sustained bandwidth]
Experiment + regression
to estimate α, β
10
First-principles modeling
to better capture details
of system & application
components
Data-driven modeling to
learn unknown details of
system & application
components
Model
composition
Model, data
comparison
The RAMSES vision
To develop a new science of end-to-end
analytical performance modeling that will
transform understanding of the behavior of
science workflows in extreme-scale science
environments.
Based on integration of first-principles and
data-driven modeling, and structured
approach to model evaluation & composition
11
Modeling
Develop, evaluate,
and refine component
and end-to-end models
Tools
Develop easy-to-use
tools to provide end-
users with actionable
advice
Estimation
Develop and apply data-
driven estimation methods:
differential regression,
surrogate models,
etc.
Experiments
Extensive, automated
experiments to test models
& build database
The RAMSES research agenda & platform
12
Evaluators Advisor
TesterEstimators
Databas
e
We are informed by five challenge workflows
13
Transfer: High-performance, end-to-end
file transfer
Scattering: Capture and analysis of
diffuse scattering experimental data
MapReduce: Data-intensive, distributed
data analytics
Exascale: Performance of exascale
application kernels on memory hierarchies
In-situ: Configuration and placement of in-
situ analysis computations
14
Application
OS
FS Stack
HBA/HCA
LANSwitch
Router
Source data transfer node
TCP
IP
NIC
Application
OS
FS Stack
HBA/HCA
LAN
Switch
Router
Predict: Throughput for configuration
Explain: Factors influencing performance
Optimize: Parameters for high speeds
TCP
IP
NIC
Storage Array
Wide Area
Network
OST
MDT
Lustre file system
Destination data transfer node
OSS
OSS
MDS
MDS
Transfer: End-to-end file movement
Scattering: Linking simulation and
experiment to study disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimentalscattering
Material composition
Simulated structure
Simulatedscattering
La 60%Sr 40%
Detect errors (secs—mins)
Knowledge basePast experiments;
simulations; literature; expert knowledge
Select experiments (mins—hours)
Contribute to knowledge base
Simulations driven by experiments (mins—days)
Knowledge-drivendecision making
Evolutionary optimization
Immediate assessment of alignment quality in
near-field high-energy diffraction microscopy
16
Blue Gene/QOrthros
(All data in NFS)
3: Generate
Parameters
FOP.c
50 tasks
25s/task
¼ CPU hours
Uses Swift/K
Dataset
360 files
4 GB total
1: Median calc
75s (90% I/O)
MedianImage.c
Uses Swift/K
2: Peak Search
15s per file
ImageProcessing.c
Uses Swift/K
Reduced
Dataset
360 files
5 MB total
feedback to experiment
Detector
4: Analysis PassFitOrientation.c
60s/task (PC)
1667 CPU hours
60s/task (BG/Q)
1667 CPU hours
Uses Swift/TGO Transfer
Up to
2.2 M CPU hours
per week!
ssh
Globus Catalog
Scientific Metadata
Workflow ProgressWorkflow
Control
Script
Bash
Manual
This is a
single
workflow
3: Convert bin L
to N
2 min for all files,
convert files to
Network Endian
format
Before
After
Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer
MapReduce: Distributing data and
computation for data analytics
...
...
Data
Slaves
Master
Local Cluster
LocalReduction
Job Assignment
...
...
Data
Slaves
Master
Cloud Environment
Job Assignment
LocalReduction
Index
17
Remote data
analysis
Job
assignment
Global
reduction
Exascale simulation
18
HACC Cosmology
• Compute intensive phase with
regular stride one access
• Tree walk phase: irregular
memory access with high
branching and integer ops
• 3D FFT communication intensive
phase
• I/O Phase
Images Courtesy: Joseph Insley (Argonne)
Nek5000 CFD
• Matrix vector product phase
• Conjugate gradient iteration
• Communication phase
involving nearest neighbor
exchange and vector
reductions
Compute
Resource
(Multi
Petaflop,
High Radix
Interconnect
Dragonfly,
5D Torus)
I/O
Nodes
Switch
Complex
(IB) File Server
Nodes
Analysis
Nodes/Cluster
Storage System
In situ analysis on the DOE Leadership
Computing Infrastructure
1536
GB/s
DTN Nodes
We need to perform the right computation at
the right place and time, taking into account
details of the simulation, resources, and analysis
1
2
3
4
A diverse set of components
Serv
er
Par
alle
l co
mp
ute
r
Ro
ute
r
Sto
rage
sys
tem
LAN
WA
N
TCP,
UD
T
Gri
dFT
P
File
sys
tem
s
Gri
dFT
Pse
rve
r
NEC
bo
ne
HA
CC
bo
ne
Ch
eck
sum
Encr
ypti
on
Map
Re
du
ce
Oth
er
app
s
Transfer Y Y Y Y Y Y Y Y Y Y Y
Scattering Y Y Y Y Y Y Y Y
Exascale Y Y Y Y Y Y
DistributedMapReduce Y Y Y Y Y Y Y Y Y
In-Situ Y Y Y Y Y Y Y Y
20
Develop, evaluate, and refine
component and end-to-end
models
• Models from the literature
• Fluid models for network flows
• SKOPE modeling
system
21
Develop and apply
data-driven
estimation methods
• Differential regression
• Surrogate models
• Other methods from literature
Develop easy-to-use tools to
provide end-users with
actionable advice
• Runtime advisor, integrated
with Globus transfer system
Automated experiments to
test models and build
database
• Experiment design
• Testbeds
OverviewInput Output
Code
skeletons
Parser
Per-function
intermediate repr.
(Block Skeleton Trees)
Transformation
engine
Behavior
modeling engine
Execution-based
intermediate repr.
(Bayesian execution tree)
Transformed
Bayesian execution
tree
Characterization
engine
Performance
projection
Hardware model
system
specifications
Performance
projection
Schema for
suggested
tranformations
Synthesized
characteristics
Source code
User Effort
(semi-automated with
a source-to-source
translator)
Automatic
SKOPE language
Workload input
Fro
nt
end
Back e
nd
Bottleneck analysis
SKOPE
performance
modeling
framework
Differential regression for combining
data from different sources
Example of use: Predict performance on connection length L
not realizable on physical infrastructure
E.g., IB-RDMA or HTCP throughput on 900-mile connection
1) Make multiple measurements of performance on path lengths d:
– Ms(d): OPNET simulation
– ME(d): ANUE-emulated path
– MU(di): Real network (USN)
2) Compute measurement regressions on d: ṀA(.), A∈{S, E, U}
3) Compute differential regressions: ∆ṀA,B(.) = ṀA(.) - ṀB(.), A, B∈{S, E, U}
4) Apply differential regression to obtain estimates, C∈{S, E}
𝓜U(d) = MC(d) - ∆ṀC,U(d)
simulated/emulated measurements point regression estimate
We will extend the differential regression
method in several areas
• To compare different component models
– E.g., different models of network elements, storage
systems, protocol implementations
• To compare different composite models
– E.g., different methods for combining memory and
CPU models
• To compare model outputs with measurements
24
System
parameters
Task size
parameters
Component model
component
i
cost
terms
performance
quality model
p
i s
i
Experiment design
(active learning)
Analytical
and
empirical
models
Qi( p
i,s
i) is a regression
estimate of
Source LAN
profile
WAN
profileDestination LAN
profile
Configuration for
host and edge
devices
Configuration
for WAN
devices
Configuration for
host and edge
devices
composition
operations
End-to-end profile composition
End-to-end model composition & analysis
• End-to-end model using composition
– It is an approximation: due to component interactions
not modelled by the composition operator
• Actual end-to-end performance model
– Component models are “corrected” to account for un-
modelled effects: this form is assumed to exist
27
Using end-to-end measurements and differential
regression to correct regression estimates
• Regression estimate of composed model:
– “Estimated”, since components models are “incomplete”
as derived from first principles and/or measurements
• Error due to regression estimate:
• Error can be mitigated using measurements:
Corrected estimate of :
28
Q p,s( )Å Q p,s( ) = Q p,s( ) - Q p,s( )é
ëùû
2
ˆ (p, )Q s
Qp,s
Q p,s( ) = Q p,s( )+ D p,s( )
Analytical
model
Correction from differential
regression using
measurements
Performance guarantees
• Vapnik-Chervonenkis theory: under finite VC-dim(F)
– Guarantees that error of regression estimate is close to
optimal with a certain probability
– Distribution-free: does not require detailed knowledge
of error distributions – uses end-to-end measurements
• Error of the corrected estimate:
29
ip
P I D,Q, p( )- I D*,Q, p( ) > e{ } <d F,l,e( )
I D,Q, p( ) = Qp,s
- Q p,s( ) - D p,s( )éë
ùûò dP
Qp,s
Estimated Optimal
Surrogate modeling framework
to inform choice of experiments
30
Machine learning &
optimization
Performance
metricsInformative
configurations
First-principles models
Evaluation
GridFTP flow i, parallelism ki
Bottleneck router
Solve for throughputs, and
transfer delays
Special case: known p
Fluid models of network flows
31
GridFTP flow i:
RTT Ri
Throughput Ti
Bottleneck
router:
Capacity C
Loss rate p
{ 0}1 Q j
j
dQC T
dt
ii
i
kT
R p
2
( )( ) ( )
2
i i ii
i i
dT k T tT t p t
dt R k
32
Analytical models
Regressionmodels
Model composition
Emulators
Experiments Historical logs
Code skeletons
SKOPE language
Workload parameters
Sourcecode
Benchmarks
Simulators
SKOPE
Performance projections
System models (current or future)
Application behavior models
Our
multi-
modal
approach
33
Analytical models
Regressionmodels
Model composition
Experiments Historical logs
Code skeletons
SKOPE language
Workload parameters
Sourcecode
SKOPE
File transfer performance projections
System models Application behavior models
Storage, TCP, WAN
iperf
XDDEmulators
GridFTP
Application
to file
transfer
34
Analytical models
Regressionmodels
Model composition
Experiments Historical logs
Code skeletons
SKOPE language
Workload parameters
Sourcecode
SKOPE
Exascale simulation perf. projections
System models Application behavior modelsCompute, memory,
interconnect
MPI benchmarks
IORDGEMM
Stream
pedagogical example, the code skeleton for dense mat rix mul-t iplicat ion (denoted with Mat Mul ) is shown in List ing 2. Thecorresponding CPU code is shown in List ing 1 in C. The syntaxof a code skeleton is not the focus of this paper. I t is brieflyint roduced in the comments of the example code skeletons andis not discussed in further detail.
L ist ing 1: Mat Mul ’s CPU code
1 f l oat A[ N] [ K] , B[ K] [ M] ;f l oat C[ N] [ M] ;
3 i nt i , j , k ;f or ( i =0; i <N; ++i ) {
5 f or ( j =0; j <M; ++j ) {f l oat sum = 0;
7 f or ( k =0; k <K; ++k ) {sum+=A[ i ] [ k ] * B[ k ] [ j ] ;
9 }C[ i ] [ j ] = sum;
11 }
L ist ing 2: Mat Mul ’s code skele-t on
1 f l oat A[ N] [ K]f l oat B[ K] [ M]
3 f l oat C[ N] [ M]/ * t he l oop space * /
5 par al l el _f or ( N, M): i , j
7 {/ * comput at i on w/ t
9 * i ns t r uc t i on count* /
11 comp 1/ * s t r eami ng l oop * /
13 s t r eam k = 0: K {/ * l oad * /
15 l d A[ i ] [ k ]l d B[ k ] [ j ]
17 comp 3}
19 comp 5/ * s t or e * /
21 st C[ i ] [ j ]}
L ist ing 3: Mat Mul ’s opt im ized GPUcode
f l oat A[ N] [ K] , B[ K] [ M] , C[ N] [ M] ;2 di m3 bl ock ( Bl kSi ze , Bl kSi ze ) ;
di m3 gr i d ( N/ Bl kSi ze , M/ Bl kSi ze ) ;4 Mat r i xMul <<<gr i d , bl ock >( A, B, C) ;
6 __gl obal __ Mat r i xMul ( A, B, C){
8 __shar ed__ a[ Bl kSi ze ] [ Bl kSi ze ] ;__shar ed__ b[ Bl kSi ze ] [ Bl kSi ze ] ;
10 i nt t y = t hr eadI dx . y ;i nt t x = t hr eadI dx . x ;
12 i nt y = bl ock I dx . y * bl ockDi m. y+t y ;i nt x = bl ock I dx . x * bl ockDi m. x+t x ;
14 f l oat sum = 0. f ;f or ( i nt n=0; n<K; n+=Bl kSi ze ) {
16 a[ t y ] [ t x ] =A[ y ] [ n+t x ] ;b[ t y ] [ t x ] = B[ n+t y ] [ x ] ;
18 __sync t hr eads ( ) ;f or ( i nt k =0; k <Bl kSi ze ; ++k ) {
20 sum += a[ t y ] [ k ] * b[ k ] [ t x ] ;}
22 __sync t hr eads ( ) ;}
24 C[ y ] [ x ] = sum;}
The following informat ion forms a code skeleton that expressesa computat ional kernel.
D at a par al lel ism is expressed as a set of parallel, homoge-neous tasks repeated over different data elements. Users shouldexpress data parallelism in it s finest granularity (i .e., down tothe innermost parallel f or loops).
A t ask corresponds to one iterat ion of the innermost parallelf or loop. I t is expressed as a sequence of data accesses andcomputat ion.
D at a accesses are expressed as a set of load and store oper-at ions. The accessed array elements are expressed given loop in-dices, array sizes, and other constants. Indirect data accesses canbe expressed as well; GROPHECY will assume indirect accessesare random unless users provide further hint s (see Sect ion 9.4and List ing 6).
Com put at ion inst r uct ions are counted by using methodsdescribed in Sect ion 7.3. Together with the number of memoryinst ruct ions, they indicate the computat ional intensity of t hekernel.
B r anch inst r uct ions are counted to judge the applicabilityof loop unrolling.
For loops wrap around blocks of computat ion and data ac-cesses to mark repet it ion within a task. They can be nested andthe nest ing does not have to be perfect .
St r eam ing loops are a special type of f or loop; they aremarked where a sequence of data elements are fetched and pro-cessed and can be discarded immediately. I t is a common pat ternfor reduct ion. St reaming loops can be temporally decomposedinto stages for the purpose of caching. Line 7 in List ing 1 is anexample of a st reaming loop.
M acr os that define array sizes and the number of loop itera-t ions. By adjust ing the macros, the same code skeleton can beused for workloads at different scales.
Once const ructed, the code skeleton can then be t ransformedto mimic GPU opt imizat ions. Note that the mimicked GPU im-plementat ion can differ significant ly from the original CPU code.As an example, List ing 3 shows the GPU kernel of Mat Mul , wheref or loops are not only spat ially decomposed among threadsbut also temporally decomposed into stages for the purpose ofcaching. Both t ransformat ions are common and crit ical in man-ual GPU opt imizat ion.
6. Code TransformationsGiven the code skeleton, GROPHECY t ransforms and lays
out code for a target GPU (recall Figure 1, Step 2). This sec-t ion describes how code layouts are represented (Sect ion 6.1),how the space of possible layouts is searched (Sect ion 6.2), andaddit ional representat ions and met rics needed to carry out thissearch (Sect ions 6.3–6.7).
6.1 Code Layout ParameterizationCode t ransformat ion involves the following factors, whose val-
ues joint ly define a part icular code layout .T hr ead block sizes, represented as B = { b1 , ..., bn } , where
n is the dimensionality of the loop space and bi is the lengthof the thread block in the i th dimension; si ze(B) denotes thenumber of threads in a thread block. We vary the thread blocksize given the loop space and the hardware const raint on thenumber of threads per block.1
St aging, or temporarily decomposing st reaming loops into se-quent ial stages of iterat ions. Within one stage, a thread blockonly needs to cache the port ion of data elements used in thisstage. Staging can be expressed as two integer vectors. For acode skeleton with n st reaming loops, S = { s1 , ..., sn } containssi which defines the staging size, or the number of iterat ions inone stage for the i th st reaming loop. Moreover, some consecu-t ive st reaming loops actually form a mult idimensional st reamingloop, whose t raversal orders are interchangeable with regard toouter loops and inner loops. Different t raversal orders may resultin different performances as a result of data locality and caching.Therefore, O = { o1 , ..., on } defines the t raversal order where oj
is the ident ifier of the j th st reaming loop to be t raversed.Folding, or assigning mult iple tasks to one thread. I t is rep-
resented as F = { f 1 , ..., f n } , where n is the dimensionality of theloop space and f i is the number of indices assigned to a threadalong the i th loop. When folding is not applied, GROPHECYassumes each thread computes one task and f i = 1 for all i ’s.The folding degree, F , is defined as the total number of tasks as-signed to a thread, or
n
i = 1f i . For the purpose of data reuse and
coalescing, folding always assigns neighboring tasks to threadswith adjacent thread indices [27]. Once applied, addit ional loopstatements will be added so that a thread can it erate throughassigned tasks. These addit ional loop statements are consideredas st reaming loops, and staging can be applied.
Caching St r at egy . The caching st rategy categorizes dataaccesses into uncached accesses to global memory and cachedaccesses to shared memory. For shared memory, the cachingst rategy also describes which array segments are cached. Weuse bounded regular sections (BRS) [12], a derived form of regu-lar section descr iptors (RSD) [6, 4], to represent data accesses.A data access statement in the code skeleton can be representedas A(D ,Θ, I ). D is the array to be accessed. Θ = { θ1 , ..., θm } ,where θj is the index to D ’s j t h dimension. Each θ can be afunct ion involving I = { I 1 , ..., I n } , which are indices of the loopsthat contain this data access statement . For all data accessesin a code skeleton, a code layout uses { A} to specify the set ofuncached memory accesses and { A} to specify the set of cachedmemory accesses. The shape of D ’s region cached in sharedmemory during each stage of the kth st reaming loop is denotedwith ShM em(D i , k); k = 0 corresponds to cached data for mem-ory accesses outside any st reaming loops. ShM em(D i , k) is afootpr int defined in Sect ion 6.3 and can be obtained by Equa-t ion 5.
L oop U nr ol l ing. Loop unrolling reduces inst ruct ions dueto loop overhead and is especially important for computat ion-bounded workloads. I t can be expressed by L = { l i , ..., ln } ,where l i is the number of iterat ions to be unrolled for the i thloop. According to our empirical studies of the NVCC com-piler [29], GROPHECY applies loop unrolling to any inner-thread,branch-free loops whose number of iterat ions can be determined
1 In a code layout , the dimensionality of a modeled thread blockis not rest ricted since a high-dimensional loop space can be flat -tened and reduced to a lower-dimensional space.
3
Application
to exascale
simulation
A performance database
• We aim to collect instrumentation data in a
central database to simplify model validation
• We plan to use the perfSONAR measurement
archive tool as a starting point
– REST API on top of Cassandra and Postgres
– Optimized for time series data
– Will extend as needed
– http://software.es.net/esmond/
35
Application to transfer optimization
36
Performance
predictor
Parameter
database
Performance
analyst
Model
refiner
User
feedback
agent
Globus
service(1) Transfer
description
(3) Transfer
performance
(4) User
feedback
Prediction
Analysis
Analysis
Parameter
update
(2)
Prediction
Summary
• We focus on the science of modeling: integration
of first-principles and data-driven models; model
composition and evaluation
• Our challenge applications span a broad
spectrum of DOE resources and disciplines
• We see big opportunities for cooperation: e.g.,
on development and evaluation of component
models
www.ramsesproject.org [soon!]
37
Thanks, and for more information
• Thanks to our sponsors:
Advanced Scientific Computing Research
Program manager: Rich Carlson
• Thanks to my RAMSES project co-participants
• For more information, please see
https://sites.google.com/site/ramsesdoeproject/
ianfoster.org and @ianfoster 38