TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can...

Post on 18-Sep-2020

0 views 0 download

Transcript of TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can...

TASK-BASED DYNAMIC

SCHEDULING APPROACH TO

ACCELERATING NASA’SGEOS-5

Eric Kelmelis

April 5, 2016

Copyright © 2016 EM Photonics – EM Photonics Proprietary

OUTLINE

TASK-BASED PROGRAMMING

SORAD IMPLEMENTATION

RESULTS

Concepts and goals

Converting SORAD (GEOS-5) for task paradigm

Testing and benchmarking

2

Copyright © 2016 EM Photonics – EM Photonics Proprietary

MODERN HIGH-PERFORMANCE COMPUTING NODE

3

Node

CPU CPU

GPU GPU

Node

CPU

High-performance computing nodes have evolved

Copyright © 2016 EM Photonics – EM Photonics Proprietary

MOTIVATION

TAKE ADVANTAGE OF HYBRID SYSTEMS

EFFICIENT RESOURCE UTILIZATION

TECHNICAL EXPERTISE

NASA Codes are running on increasingly varied, heterogeneous computer systems

Utilizing computational resources effectively is difficult as their properties become less regular

Scientists do not often have sufficient time or expertise to performance tune their codes

4

Copyright © 2016 EM Photonics – EM Photonics Proprietary

PROPOSED SOLUTION: TASK-BASED PROGRAMMING

» A task is pure function which receives inputs and delivers outputs according to a task graph

» Details of task invocation can be abstracted from the programmer

» Tasks can be developed, tested, and characterized as independent units

» Synchronization between tasks is defined purely by connections in the task graph

5

Copyright © 2016 EM Photonics – EM Photonics Proprietary

PRINCIPLES OF A TASK

» Managing data:

» Tasks cannot depend on mutable state other than explicit inputs

» Tasks cannot modify state other than explicit outputs

» These constraints allow us to easily draw a graph of dependencies between tasks

6

Copyright © 2016 EM Photonics – EM Photonics Proprietary

TASK GRAPHS

» Define the dependencies between tasks

» Allows creating a schedule for executing them

» Can be conceptualized as data flowing through a task graph

» Data transfer easy to model by looking at dependencies

» Tasks can be divided among threads and devices for concurrent execution

7

Copyright © 2016 EM Photonics – EM Photonics Proprietary

TASK-GRAPH CONFIGURATION

Three things must be done to convert (part of) a program to a task-graph:

1. Write task definitions

2. Specify task-graph

3. Add calls in surrounding program for task-graph inputs and outputs

8

BUILDING A TASK-BASED PROGRAM

Copyright © 2016 EM Photonics – EM Photonics Proprietary

BUILDING MODULES

Task Description

Task 1

Kernel

K1 for T1

Kernel

K1 for T2

Kernel

K1 for T3

Kernel

K1 for T4

Compilers

T1:K1 T1:K2 T1:K3 T2:K1

T2:K2 T3:K1 T4:K1 T4:K2

Module Builder

Module 1

T1:K1 T1:K2 T1:K3

Module 2

T2:K1 T2:K2

Module 3

T3:K1

Module 4

T4:K1 T4:K2

Module Builder

Programmer Defined

Existing Tools

New Tools

Constants

Constants Constants

ConstantsConstants

10

Modules implement a task

Copyright © 2016 EM Photonics – EM Photonics Proprietary

COMPUTATIONAL FUNCTIONS FOR MODULES

// Call CUDA kernel

dim3 dimGrid( NX/TILE_WIDTH, NZ/TILE_WIDTH );

dim3 dimBlock( TILE_WIDTH, TILE_WIDTH );

primitive_variables_ker <<<dimGrid, dimBlock>>>

(p_cons, p_rho, p_vx, p_vy, p_vz,

p_p, p_bx, p_by, p_bz, p_psic);

// Call Intel Xeon Phi code

primitive_variables_mic (p_cons, p_rho, p_vx, p_vy,

p_vz, p_p, p_bx, p_by, p_bz, p_psic);

// Call CPU code

primitive_variables_cpu (p_cons, p_rho, p_vx, p_vy,

p_vz, p_p, p_bx, p_by, p_bz, p_psic);

Develop computational functionality for each targeted processing device

UpdateVariables()

• CPU Kernel

• CUDA Kernel

• OpenCL Kernel

• Xeon Phi Kernel

• [Future architectures]

Module

11

Copyright © 2016 EM Photonics – EM Photonics Proprietary

BUILDING AND EXECUTING GRAPHS

Graph Configuration

• Module 1 Module 2• Module 1 Module 3• Module 2, Module 3

Module 4

Module 1

T1:K1 T1:K2 T1:K3

Module 2

T2:K1 T2:K2

Module 3

T3:K1

Module 4

T4:K1 T4:K2

Constants Constants

ConstantsConstants

Graph Builder

Module 1

Module 2 Module 3

Module 4

12

Programmer Defined

Existing Tools

New Tools

Copyright © 2016 EM Photonics – EM Photonics Proprietary

SCHEDULING TASKS

A B

C

D

E

F

- Name- Inputs- Outputs

A

Packaged Modules Graph

E

D

A

B

C F

Schedule

13

Copyright © 2016 EM Photonics – EM Photonics Proprietary

STATIC VS. DYNAMIC SCHEDULING

14

» Define an execution schedule up front

» Advantages:» No run time overhead

» No supporting scheduler software required

» Disadvantages:» Can miss efficiencies

» Very difficult as applications get large

Static Scheduling

» Run-time task distribution

» Advantages:

» No upfront work by the programmer

» Often outperforms a human

» Disadvantages:

» Adds overhead

Dynamic Scheduling

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DYNAMIC GRAPH RUNTIME PROCESSES

Resolve dependencies

Choose device

Allocate resources

Marshall data

Queue task

Invoke

Send outputs

15

BREAKING SORAD INTO TASKS

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DETERMINING TASKS

» Identified key functional components» Broke down and tested each component» Built task versions

» Cloud cover» Chemistry» Clear tau» Cloud tau» Optics (clear and cloudy)» Flux calculation» Flux integration» Absorption» Flux absorption

17

Copyright © 2016 EM Photonics – EM Photonics Proprietary

SORAD GRAPH

18

Copyright © 2016 EM Photonics – EM Photonics Proprietary

CODE TO BUILD SORAD TASK GRAPH (SUBSET)

19

Copyright © 2016 EM Photonics – EM Photonics Proprietary

SORAD GRAPH – SCHEDULER GENERATED

20

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DATA INPUT AND OUTPUT

» Source nodes – Data Input

» Can only have output connections

» Data provided directly by the end user

» Sink nodes – Data Output

» Output from the graph is retrieved in an asynchronous manner as it becomes available

» The user associates a callback function with the sink node which will be called once for each chunk of data that arrives at a sink node

Application

Src Src

Sk Sk

Src

Application

21

SORAD Interface

Copyright © 2016 EM Photonics – EM Photonics Proprietary

LINKING TASK SORAD TO EXTERNAL APPLICATION

(INPUTS)

22

BUILDING SORAD TASKS AND MODULES

Copyright © 2016 EM Photonics – EM Photonics Proprietary

WRITING TASK INTERFACES

Specifies the relevant calling information

24

Copyright © 2016 EM Photonics – EM Photonics Proprietary

EXAMPLE TASK INTERFACE

25

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DETERMINING TASKS TO ACCELERATE

» Certain tasks would benefit greatly from being run on a GPU or accelerator device (others would not)

» Rather than iterating over each band for one of these tasks, we can run each calculation simultaneously» Tasks are independent over columns and levels» Also independent over UV and IR bands

» Tasks we focused on for GPU-implementations:» Clear optics» Cloudy optics» Flux» Flux integration

» These tasks represented the bulk of the runtime and are computationally intensive

26

Copyright © 2016 EM Photonics – EM Photonics Proprietary

INITIAL APPROACH: OPENACC

» Easy-to-use directives

» Similar approach to OpenMP

» Portable and potentially usable with single-source

Pros

» Little-to-no data management control within a kernel

» SORAD is in FORTRAN!

» Many features we needed are buggy or unsupported

» Ignores specified directives in favor of compiler "optimizations"

Cons

27

Copyright © 2016 EM Photonics – EM Photonics Proprietary

FINAL IMPLEMENTATION: CUDA

» More control over execution strategy and data management

» Mature technology which supports most common language features

» Using the task-based approach, not locked into CUDA forever

Pros

» NVIDIA only

» Requires significantly more work

» In some cases substantial rewrites necessary to get good performance

Cons

28

Copyright © 2016 EM Photonics – EM Photonics Proprietary

NOTE ON PERFORMANCE TUNING

We performed limited performance tuning

» Selecting block sizes and number of blocks to fully use hardware without incurring extra overhead

» Ensuring data is accessed efficiently

» Temporary data uses registers and shared memory effectively

» Limiting register usage to ensure good occupancy

» Typically 64 or fewer registers per thread is desirable

29

SCHEDULED SORAD: COMPARISON AND BENCHMARKING

Copyright © 2016 EM Photonics – EM Photonics Proprietary

STATICALLY SCHEDULED SORAD

» Static scheduled implementation in pure Fortran» Benefits of the task-based approach without

additional software» Breaks the problem into chunks along columns

» Overlaps the processing of several chunks

» Memory for entire run can be allocated up-front» Limiting parallelism to a few parallel runs

» CPU performance is somewhat lower when using the PGI compiler compared to Intel

31

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DYNAMICALLY SCHEDULED SORAD

Tested with the first prototype of our dynamic task scheduler

» Tasks are registered to the scheduler and connected into a graph

» Data is pushed into and received out of the graph asynchronously

» The scheduler can manage transfers between CPU and GPU automatically

» Allows tasks to be mixed and matched without additional effort

» The results show considerable task-level parallelism» Scheduler still has a lot of room for optimization

32

Copyright © 2016 EM Photonics – EM Photonics Proprietary

TASK-BASED SORAD VERSION

33

Task-Based SORAD

Statically Scheduled

» SORAD broken down into a component tasks and connected in a graph

» Advantages: Easier to maintain, test, and extend; Can be scheduled

» An upfront static schedule can be built to execute the SORAD task graph

» Advantages: Better performance than regular SORAD; Better portability

Dynamically Scheduled

» Could be scheduled with any framework (e.g. Intel’s TBB, etc.)

» Our prototype scheduler showed hybrid operation and significant parallelism

Copyright © 2016 EM Photonics – EM Photonics Proprietary

ORIGINAL SORAD

» All flux» Max percent error – 0.000%» Average percent error – 0.000%

» Upward flux» Max percent error – 0.000%» Average percent error – 0.000%

» Clear flux» Max percent error – 0.000%» Average percent error – 0.000%

» Clear upward flux» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (direct UV)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (diffuse UV)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (direct PAR)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (diffuse PAR)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (direct IR)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (diffuse IR)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (per-band)» Max percent error – 0.000%» Average percent error – 0.000%

34

Execution Time – 5.826 seconds

Copyright © 2016 EM Photonics – EM Photonics Proprietary

STATICALLY SCHEDULED TASK-BASED SORAD (CPU-ONLY)

» All flux» Max percent error – 0.000%» Average percent error – 0.000%

» Upward flux» Max percent error – 0.002%» Average percent error – 0.000%

» Clear flux» Max percent error – 0.000%» Average percent error – 0.000%

» Clear upward flux» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (direct UV)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (diffuse UV)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (direct PAR)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (diffuse PAR)» Max percent error – 0.001%» Average percent error – 0.000%

» Surface flux (direct IR)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (diffuse IR)» Max percent error – 0.001%» Average percent error – 0.000%

» Surface flux (per-band)» Max percent error – 0.000%» Average percent error – 0.000%

35

Execution Time – 3.373 seconds

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DYNAMICALLY SCHEDULED TASK-BASED SORAD (CPU-ONLY)

36

Execution Time – 5.788 seconds

Copyright © 2016 EM Photonics – EM Photonics Proprietary

STATICALLY SCHEDULED TASK-BASED SORAD (HYBRID)

» All flux» Max percent error – 1.102%» Average percent error – 0.002%

» Upward flux» Max percent error – 0.026%» Average percent error – 0.002%

» Clear flux» Max percent error – 0.147%» Average percent error – 0.001%

» Clear upward flux» Max percent error – 0.013%» Average percent error – 0.001%

» Surface flux (direct UV)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (diffuse UV)» Max percent error – 0.191%» Average percent error – 0.005%

» Surface flux (direct PAR)» Max percent error – 0.003%» Average percent error – 0.000%

» Surface flux (diffuse PAR)» Max percent error – 0.188%» Average percent error – 0.004%

» Surface flux (direct IR)» Max percent error – 0.016%» Average percent error – 0.000%

» Surface flux (diffuse IR)» Max percent error – 0.188%» Average percent error – 0.005%

» Surface flux (per-band)» Max percent error – 0.191%» Average percent error – 0.002%

37

Execution Time – 2.212 seconds

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DYNAMICALLY SCHEDULED TASK-BASED SORAD (HYBRID)

38

Execution Time – 4.7 seconds

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DYNAMICALLY SCHEDULED TASK-BASED SORAD (HYBRID) - MAGNIFIED

39

Execution Time – 4.7 seconds

Copyright © 2016 EM Photonics – EM Photonics Proprietary

PERFORMANCE COMPARISON

Dynamic scheduled versions took about 1 second to handle data allocation

40

0

1

2

3

4

5

6

7

Original Static CPU-only Dynamic CPU-only

Static Hybrid Dynamic Hybrid

Execution Times

Execution Allocation

Copyright © 2016 EM Photonics – EM Photonics Proprietary

CONCLUSION

41

Built Task-Based SORAD

Static Scheduling

» Analyzed SORAD and decomposed into a set of tasks and related DAG

» Code is now streamlined, more consistent, and easier to maintain and extend

» Much higher performance than original (both CPU and hybrid versions)

» SORAD small enough that a near optimal schedule can be determined up front

Dynamic Scheduling

» Options: EMP Prototype Scheduler (hybrid system), TBB (single CPU only), etc.

» Faster than original but more useful on larger applications