RAMSES: Robust Analytic Models for Science at Extreme Scales

Gagan Agarwal1* Prasanna Balaprakash2 Ian Foster2* Raj Kettimuthu2

Sven Leyffer2 Vitali Morozov2 Todd Munson2 Nagi Rao3*

Saday Sadayappan1 Brad Settlemyer3 Brian Tierney4* Don Towsley5*

Venkat Vishwanath2 Yao Zhang2

1 Ohio State University 2 Argonne National Laboratory 3 Oak Ridge National Laboratory 4 ESnet 5 UMass Amherst (* Co-PIs)

Advanced Scientific Computing Research

Program manager: Rich Carlson♦

2

Source

data

store

Desti-

nation

data

store

Wide Area

Network

Prediction, explanation, & optimization are

challenging for even “simple” E2E workflows

For example, file transfer, for which we want to:

• Predict achievable throughput for a specific configuration

• Explain factors influencing performance

• Optimize parameter values to achieve high speeds

3

Application

OS

FS Stack

HBA/HCA

LANSwitch

Router

Source data transfer node

TCP

IP

NIC

Application

OS

FS Stack

HBA/HCA

LAN

Switch

Router TCP

IP

NIC

Storage Array

Wide Area

Network

OST

MDT

Lustre file system

Destination data transfer node

OSS

OSS

MDS

MDS

Prediction, explanation, & optimization are

challenging for even “simple” E2E workflows

+ diverse environments+ diverse workloads+ contention

85 Gbps sustained disk-to-disk over 100

Gbps network, Ottawa—New Orleans

4

Raj Kettiumuthu

and team,

Argonne

High-speed transfers to/from AWS cloud,

via Globus transfer service

• UChicago AWS S3 (US region): Sustained 2 Gbps

– 2 GridFTP servers, GPFS file system at UChicago

– Multi-part upload via 16 concurrent HTTP connections

• AWS AWS (same region): Sustained 5 Gbps

5

go#s3

6

Endpoint aps#clutch has transfers to 125 other endpoints

Endpoint aps#clutch has transfers to 125 other endpoints

One Advanced

Photon Source

data node:

125 destinations

Same

node

(1 Gbps

link)

How to create more accurate, useful, and

portable models of such systems?

Simple analytical model:

T= α+ β*l[startup cost + sustained bandwidth]

Experiment + regression

to estimate α, β

10

First-principles modeling

to better capture details

of system & application

components

Data-driven modeling to

learn unknown details of

system & application

components

Model

composition

Model, data

comparison

The RAMSES vision

To develop a new science of end-to-end

analytical performance modeling that will

transform understanding of the behavior of

science workflows in extreme-scale science

environments.

Based on integration of first-principles and

data-driven modeling, and structured

approach to model evaluation & composition

11

Modeling

Develop, evaluate,

and refine component

and end-to-end models

Tools

Develop easy-to-use

tools to provide end-

users with actionable

advice

Estimation

Develop and apply data-

driven estimation methods:

differential regression,

surrogate models,

etc.

Experiments

Extensive, automated

experiments to test models

& build database

The RAMSES research agenda & platform

12

Evaluators Advisor

TesterEstimators

Databas

e

We are informed by five challenge workflows

13

Transfer: High-performance, end-to-end

file transfer

Scattering: Capture and analysis of

diffuse scattering experimental data

MapReduce: Data-intensive, distributed

data analytics

Exascale: Performance of exascale

application kernels on memory hierarchies

In-situ: Configuration and placement of in-

situ analysis computations

14

Application

OS

FS Stack

HBA/HCA

LANSwitch

Router

Source data transfer node

TCP

IP

NIC

Application

OS

FS Stack

HBA/HCA

LAN

Switch

Router

Predict: Throughput for configuration

Explain: Factors influencing performance

Optimize: Parameters for high speeds

TCP

IP

NIC

Storage Array

Wide Area

Network

OST

MDT

Lustre file system

Destination data transfer node

OSS

OSS

MDS

MDS

Transfer: End-to-end file movement

Scattering: Linking simulation and

experiment to study disordered structures

Diffuse scattering images from Ray Osborn et al., Argonne

SampleExperimentalscattering

Material composition

Simulated structure

Simulatedscattering

La 60%Sr 40%

Detect errors (secs—mins)

Knowledge basePast experiments;

simulations; literature; expert knowledge

Select experiments (mins—hours)

Contribute to knowledge base

Simulations driven by experiments (mins—days)

Knowledge-drivendecision making

Evolutionary optimization

Immediate assessment of alignment quality in

near-field high-energy diffraction microscopy

16

Blue Gene/QOrthros

(All data in NFS)

3: Generate

Parameters

FOP.c

50 tasks

25s/task

¼ CPU hours

Uses Swift/K

Dataset

360 files

4 GB total

1: Median calc

75s (90% I/O)

MedianImage.c

Uses Swift/K

2: Peak Search

15s per file

ImageProcessing.c

Uses Swift/K

Reduced

Dataset

360 files

5 MB total

feedback to experiment

Detector

4: Analysis PassFitOrientation.c

60s/task (PC)

1667 CPU hours

60s/task (BG/Q)

1667 CPU hours

Uses Swift/TGO Transfer

Up to

2.2 M CPU hours

per week!

ssh

Globus Catalog

Scientific Metadata

Workflow ProgressWorkflow

Control

Script

Bash

Manual

This is a

single

workflow

3: Convert bin L

to N

2 min for all files,

convert files to

Network Endian

format

Before

After

Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer

MapReduce: Distributing data and

computation for data analytics

...

...

Data

Slaves

Master

Local Cluster

LocalReduction

Job Assignment

...

...

Data

Slaves

Master

Cloud Environment

Job Assignment

LocalReduction

Index

17

Remote data

analysis

Job

assignment

Global

reduction

Exascale simulation

18

HACC Cosmology

• Compute intensive phase with

regular stride one access

• Tree walk phase: irregular

memory access with high

branching and integer ops

• 3D FFT communication intensive

phase

• I/O Phase

Images Courtesy: Joseph Insley (Argonne)

Nek5000 CFD

• Matrix vector product phase

• Conjugate gradient iteration

• Communication phase

involving nearest neighbor

exchange and vector

reductions

Compute

Resource

(Multi

Petaflop,

High Radix

Interconnect

Dragonfly,

5D Torus)

I/O

Nodes

Switch

Complex

(IB) File Server

Nodes

Analysis

Nodes/Cluster

Storage System

In situ analysis on the DOE Leadership

Computing Infrastructure

1536

GB/s

DTN Nodes

We need to perform the right computation at

the right place and time, taking into account

details of the simulation, resources, and analysis

1

2

3

4

A diverse set of components

Serv

er

Par

alle

l co

mp

ute

r

Ro

ute

r

Sto

rage

sys

tem

LAN

WA

N

TCP,

UD

T

Gri

dFT

P

File

sys

tem

s

Gri

dFT

Pse

rve

r

NEC

bo

ne

HA

CC

bo

ne

Ch

eck

sum

Encr

ypti

on

Map

Re

du

ce

Oth

er

app

s

Transfer Y Y Y Y Y Y Y Y Y Y Y

Scattering Y Y Y Y Y Y Y Y

Exascale Y Y Y Y Y Y

DistributedMapReduce Y Y Y Y Y Y Y Y Y

In-Situ Y Y Y Y Y Y Y Y

20

Develop, evaluate, and refine

component and end-to-end

models

• Models from the literature

• Fluid models for network flows

• SKOPE modeling

system

21

Develop and apply

data-driven

estimation methods

• Differential regression

• Surrogate models

• Other methods from literature

Develop easy-to-use tools to

provide end-users with

actionable advice

• Runtime advisor, integrated

with Globus transfer system

Automated experiments to

test models and build

database

• Experiment design

• Testbeds

OverviewInput Output

Code

skeletons

Parser

Per-function

intermediate repr.

(Block Skeleton Trees)

Transformation

engine

Behavior

modeling engine

Execution-based

intermediate repr.

(Bayesian execution tree)

Transformed

Bayesian execution

tree

Characterization

engine

Performance

projection

Hardware model

system

specifications

Performance

projection

Schema for

suggested

tranformations

Synthesized

characteristics

Source code

User Effort

(semi-automated with

a source-to-source

translator)

Automatic

SKOPE language

Workload input

Fro

nt

end

Back e

nd

Bottleneck analysis

SKOPE

performance

modeling

framework

Differential regression for combining

data from different sources

Example of use: Predict performance on connection length L

not realizable on physical infrastructure

E.g., IB-RDMA or HTCP throughput on 900-mile connection

1) Make multiple measurements of performance on path lengths d:

– Ms(d): OPNET simulation

– ME(d): ANUE-emulated path

– MU(di): Real network (USN)

2) Compute measurement regressions on d: ṀA(.), A∈{S, E, U}

3) Compute differential regressions: ∆ṀA,B(.) = ṀA(.) - ṀB(.), A, B∈{S, E, U}

4) Apply differential regression to obtain estimates, C∈{S, E}

𝓜U(d) = MC(d) - ∆ṀC,U(d)

simulated/emulated measurements point regression estimate

We will extend the differential regression

method in several areas

• To compare different component models

– E.g., different models of network elements, storage

systems, protocol implementations

• To compare different composite models

– E.g., different methods for combining memory and

CPU models

• To compare model outputs with measurements

24

System

parameters

Task size

parameters

Component model

component

i

cost

terms

performance

quality model

p

i s

i

Experiment design

(active learning)

Analytical

and

empirical

models

Qi( p

i,s

i) is a regression

estimate of

Source LAN

profile

WAN

profileDestination LAN

profile

Configuration for

host and edge

devices

Configuration

for WAN

devices

Configuration for

host and edge

devices

composition

operations

End-to-end profile composition

End-to-end model composition & analysis

• End-to-end model using composition

– It is an approximation: due to component interactions

not modelled by the composition operator

• Actual end-to-end performance model

– Component models are “corrected” to account for un-

modelled effects: this form is assumed to exist

27

Using end-to-end measurements and differential

regression to correct regression estimates

• Regression estimate of composed model:

– “Estimated”, since components models are “incomplete”

as derived from first principles and/or measurements

• Error due to regression estimate:

• Error can be mitigated using measurements:

Corrected estimate of :

28

Q p,s( )Å Q p,s( ) = Q p,s( ) - Q p,s( )é

ëùû

2

ˆ (p, )Q s

Qp,s

Q p,s( ) = Q p,s( )+ D p,s( )

Analytical

model

Correction from differential

regression using

measurements

Performance guarantees

• Vapnik-Chervonenkis theory: under finite VC-dim(F)

– Guarantees that error of regression estimate is close to

optimal with a certain probability

– Distribution-free: does not require detailed knowledge

of error distributions – uses end-to-end measurements

• Error of the corrected estimate:

29

ip

P I D,Q, p( )- I D*,Q, p( ) > e{ } <d F,l,e( )

I D,Q, p( ) = Qp,s

- Q p,s( ) - D p,s( )éë

ùûò dP

Qp,s

Estimated Optimal

Surrogate modeling framework

to inform choice of experiments

30

Machine learning &

optimization

Performance

metricsInformative

configurations

First-principles models

Evaluation

GridFTP flow i, parallelism ki

Bottleneck router

Solve for throughputs, and

transfer delays

Special case: known p

Fluid models of network flows

31

GridFTP flow i:

RTT Ri

Throughput Ti

Bottleneck

router:

Capacity C

Loss rate p

{ 0}1 Q j

j

dQC T

dt

ii

i

kT

R p

2

( )( ) ( )

2

i i ii

i i

dT k T tT t p t

dt R k

32

Analytical models

Regressionmodels

Model composition

Emulators

Experiments Historical logs

Code skeletons

SKOPE language

Workload parameters

Sourcecode

Benchmarks

Simulators

SKOPE

Performance projections

System models (current or future)

Application behavior models

Our

multi-

modal

approach

33

Analytical models

Regressionmodels

Model composition


Code skeletons

SKOPE language

Workload parameters

Sourcecode

SKOPE

File transfer performance projections

System models Application behavior models

Storage, TCP, WAN

iperf

XDDEmulators

GridFTP

Application

to file

transfer

34

Analytical models

Regressionmodels

Model composition


Code skeletons

SKOPE language

Workload parameters

Sourcecode

SKOPE

Exascale simulation perf. projections

System models Application behavior modelsCompute, memory,

interconnect

MPI benchmarks

IORDGEMM

Stream

pedagogical example, the code skeleton for dense mat rix mul-t iplicat ion (denoted with Mat Mul ) is shown in List ing 2. Thecorresponding CPU code is shown in List ing 1 in C. The syntaxof a code skeleton is not the focus of this paper. I t is brieflyint roduced in the comments of the example code skeletons andis not discussed in further detail.

L ist ing 1: Mat Mul ’s CPU code

1 f l oat A[ N] [ K] , B[ K] [ M] ;f l oat C[ N] [ M] ;

3 i nt i , j , k ;f or ( i =0; i <N; ++i ) {

5 f or ( j =0; j <M; ++j ) {f l oat sum = 0;

7 f or ( k =0; k <K; ++k ) {sum+=A[ i ] [ k ] * B[ k ] [ j ] ;

9 }C[ i ] [ j ] = sum;

11 }

L ist ing 2: Mat Mul ’s code skele-t on

1 f l oat A[ N] [ K]f l oat B[ K] [ M]

3 f l oat C[ N] [ M]/ * t he l oop space * /

5 par al l el _f or ( N, M): i , j

7 {/ * comput at i on w/ t

9 * i ns t r uc t i on count* /

11 comp 1/ * s t r eami ng l oop * /

13 s t r eam k = 0: K {/ * l oad * /

15 l d A[ i ] [ k ]l d B[ k ] [ j ]

17 comp 3}

19 comp 5/ * s t or e * /

21 st C[ i ] [ j ]}

L ist ing 3: Mat Mul ’s opt im ized GPUcode

f l oat A[ N] [ K] , B[ K] [ M] , C[ N] [ M] ;2 di m3 bl ock ( Bl kSi ze , Bl kSi ze ) ;

di m3 gr i d ( N/ Bl kSi ze , M/ Bl kSi ze ) ;4 Mat r i xMul <<<gr i d , bl ock >( A, B, C) ;

6 __gl obal __ Mat r i xMul ( A, B, C){

8 __shar ed__ a[ Bl kSi ze ] [ Bl kSi ze ] ;__shar ed__ b[ Bl kSi ze ] [ Bl kSi ze ] ;

10 i nt t y = t hr eadI dx . y ;i nt t x = t hr eadI dx . x ;

12 i nt y = bl ock I dx . y * bl ockDi m. y+t y ;i nt x = bl ock I dx . x * bl ockDi m. x+t x ;

14 f l oat sum = 0. f ;f or ( i nt n=0; n<K; n+=Bl kSi ze ) {

16 a[ t y ] [ t x ] =A[ y ] [ n+t x ] ;b[ t y ] [ t x ] = B[ n+t y ] [ x ] ;

18 __sync t hr eads ( ) ;f or ( i nt k =0; k <Bl kSi ze ; ++k ) {

20 sum += a[ t y ] [ k ] * b[ k ] [ t x ] ;}

22 __sync t hr eads ( ) ;}

24 C[ y ] [ x ] = sum;}

The following informat ion forms a code skeleton that expressesa computat ional kernel.

D at a par al lel ism is expressed as a set of parallel, homoge-neous tasks repeated over different data elements. Users shouldexpress data parallelism in it s finest granularity (i .e., down tothe innermost parallel f or loops).

A t ask corresponds to one iterat ion of the innermost parallelf or loop. I t is expressed as a sequence of data accesses andcomputat ion.

D at a accesses are expressed as a set of load and store oper-at ions. The accessed array elements are expressed given loop in-dices, array sizes, and other constants. Indirect data accesses canbe expressed as well; GROPHECY will assume indirect accessesare random unless users provide further hint s (see Sect ion 9.4and List ing 6).

Com put at ion inst r uct ions are counted by using methodsdescribed in Sect ion 7.3. Together with the number of memoryinst ruct ions, they indicate the computat ional intensity of t hekernel.

B r anch inst r uct ions are counted to judge the applicabilityof loop unrolling.

For loops wrap around blocks of computat ion and data ac-cesses to mark repet it ion within a task. They can be nested andthe nest ing does not have to be perfect .

St r eam ing loops are a special type of f or loop; they aremarked where a sequence of data elements are fetched and pro-cessed and can be discarded immediately. I t is a common pat ternfor reduct ion. St reaming loops can be temporally decomposedinto stages for the purpose of caching. Line 7 in List ing 1 is anexample of a st reaming loop.

M acr os that define array sizes and the number of loop itera-t ions. By adjust ing the macros, the same code skeleton can beused for workloads at different scales.

Once const ructed, the code skeleton can then be t ransformedto mimic GPU opt imizat ions. Note that the mimicked GPU im-plementat ion can differ significant ly from the original CPU code.As an example, List ing 3 shows the GPU kernel of Mat Mul , wheref or loops are not only spat ially decomposed among threadsbut also temporally decomposed into stages for the purpose ofcaching. Both t ransformat ions are common and crit ical in man-ual GPU opt imizat ion.

6. Code TransformationsGiven the code skeleton, GROPHECY t ransforms and lays

out code for a target GPU (recall Figure 1, Step 2). This sec-t ion describes how code layouts are represented (Sect ion 6.1),how the space of possible layouts is searched (Sect ion 6.2), andaddit ional representat ions and met rics needed to carry out thissearch (Sect ions 6.3–6.7).

6.1 Code Layout ParameterizationCode t ransformat ion involves the following factors, whose val-

ues joint ly define a part icular code layout .T hr ead block sizes, represented as B = { b1 , ..., bn } , where

n is the dimensionality of the loop space and bi is the lengthof the thread block in the i th dimension; si ze(B) denotes thenumber of threads in a thread block. We vary the thread blocksize given the loop space and the hardware const raint on thenumber of threads per block.1

St aging, or temporarily decomposing st reaming loops into se-quent ial stages of iterat ions. Within one stage, a thread blockonly needs to cache the port ion of data elements used in thisstage. Staging can be expressed as two integer vectors. For acode skeleton with n st reaming loops, S = { s1 , ..., sn } containssi which defines the staging size, or the number of iterat ions inone stage for the i th st reaming loop. Moreover, some consecu-t ive st reaming loops actually form a mult idimensional st reamingloop, whose t raversal orders are interchangeable with regard toouter loops and inner loops. Different t raversal orders may resultin different performances as a result of data locality and caching.Therefore, O = { o1 , ..., on } defines the t raversal order where oj

is the ident ifier of the j th st reaming loop to be t raversed.Folding, or assigning mult iple tasks to one thread. I t is rep-

resented as F = { f 1 , ..., f n } , where n is the dimensionality of theloop space and f i is the number of indices assigned to a threadalong the i th loop. When folding is not applied, GROPHECYassumes each thread computes one task and f i = 1 for all i ’s.The folding degree, F , is defined as the total number of tasks as-signed to a thread, or

n

i = 1f i . For the purpose of data reuse and

coalescing, folding always assigns neighboring tasks to threadswith adjacent thread indices [27]. Once applied, addit ional loopstatements will be added so that a thread can it erate throughassigned tasks. These addit ional loop statements are consideredas st reaming loops, and staging can be applied.

Caching St r at egy . The caching st rategy categorizes dataaccesses into uncached accesses to global memory and cachedaccesses to shared memory. For shared memory, the cachingst rategy also describes which array segments are cached. Weuse bounded regular sections (BRS) [12], a derived form of regu-lar section descr iptors (RSD) [6, 4], to represent data accesses.A data access statement in the code skeleton can be representedas A(D ,Θ, I ). D is the array to be accessed. Θ = { θ1 , ..., θm } ,where θj is the index to D ’s j t h dimension. Each θ can be afunct ion involving I = { I 1 , ..., I n } , which are indices of the loopsthat contain this data access statement . For all data accessesin a code skeleton, a code layout uses { A} to specify the set ofuncached memory accesses and { A} to specify the set of cachedmemory accesses. The shape of D ’s region cached in sharedmemory during each stage of the kth st reaming loop is denotedwith ShM em(D i , k); k = 0 corresponds to cached data for mem-ory accesses outside any st reaming loops. ShM em(D i , k) is afootpr int defined in Sect ion 6.3 and can be obtained by Equa-t ion 5.

L oop U nr ol l ing. Loop unrolling reduces inst ruct ions dueto loop overhead and is especially important for computat ion-bounded workloads. I t can be expressed by L = { l i , ..., ln } ,where l i is the number of iterat ions to be unrolled for the i thloop. According to our empirical studies of the NVCC com-piler [29], GROPHECY applies loop unrolling to any inner-thread,branch-free loops whose number of iterat ions can be determined

1 In a code layout , the dimensionality of a modeled thread blockis not rest ricted since a high-dimensional loop space can be flat -tened and reduced to a lower-dimensional space.

3

Application

to exascale

simulation

A performance database

• We aim to collect instrumentation data in a

central database to simplify model validation

• We plan to use the perfSONAR measurement

archive tool as a starting point

– REST API on top of Cassandra and Postgres

– Optimized for time series data

– Will extend as needed

– http://software.es.net/esmond/

35

Application to transfer optimization

36

Performance

predictor

Parameter

database

Performance

analyst

Model

refiner

User

feedback

agent

Globus

service(1) Transfer

description

(3) Transfer

performance

(4) User

feedback

Prediction

Analysis

Analysis

Parameter

update

(2)

Prediction

Summary

• We focus on the science of modeling: integration

of first-principles and data-driven models; model

composition and evaluation

• Our challenge applications span a broad

spectrum of DOE resources and disciplines

• We see big opportunities for cooperation: e.g.,

on development and evaluation of component

models

www.ramsesproject.org [soon!]

37

Thanks, and for more information

• Thanks to our sponsors:

Advanced Scientific Computing Research

Program manager: Rich Carlson

• Thanks to my RAMSES project co-participants

• For more information, please see

https://sites.google.com/site/ramsesdoeproject/

ianfoster.org and @ianfoster 38

RAMSES: Robust Analytic Models for Science at Extreme Scales

Science

Transcript of RAMSES: Robust Analytic Models for Science at Extreme Scales