TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can...

41
T ASK-BASED D YNAMIC SCHEDULING APPROACH TO ACCELERATING NASA’S GEOS-5 Eric Kelmelis April 5, 2016

Transcript of TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can...

Page 1: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

TASK-BASED DYNAMIC

SCHEDULING APPROACH TO

ACCELERATING NASA’SGEOS-5

Eric Kelmelis

April 5, 2016

Page 2: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

OUTLINE

TASK-BASED PROGRAMMING

SORAD IMPLEMENTATION

RESULTS

Concepts and goals

Converting SORAD (GEOS-5) for task paradigm

Testing and benchmarking

2

Page 3: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

MODERN HIGH-PERFORMANCE COMPUTING NODE

3

Node

CPU CPU

GPU GPU

Node

CPU

High-performance computing nodes have evolved

Page 4: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

MOTIVATION

TAKE ADVANTAGE OF HYBRID SYSTEMS

EFFICIENT RESOURCE UTILIZATION

TECHNICAL EXPERTISE

NASA Codes are running on increasingly varied, heterogeneous computer systems

Utilizing computational resources effectively is difficult as their properties become less regular

Scientists do not often have sufficient time or expertise to performance tune their codes

4

Page 5: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

PROPOSED SOLUTION: TASK-BASED PROGRAMMING

» A task is pure function which receives inputs and delivers outputs according to a task graph

» Details of task invocation can be abstracted from the programmer

» Tasks can be developed, tested, and characterized as independent units

» Synchronization between tasks is defined purely by connections in the task graph

5

Page 6: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

PRINCIPLES OF A TASK

» Managing data:

» Tasks cannot depend on mutable state other than explicit inputs

» Tasks cannot modify state other than explicit outputs

» These constraints allow us to easily draw a graph of dependencies between tasks

6

Page 7: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

TASK GRAPHS

» Define the dependencies between tasks

» Allows creating a schedule for executing them

» Can be conceptualized as data flowing through a task graph

» Data transfer easy to model by looking at dependencies

» Tasks can be divided among threads and devices for concurrent execution

7

Page 8: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

TASK-GRAPH CONFIGURATION

Three things must be done to convert (part of) a program to a task-graph:

1. Write task definitions

2. Specify task-graph

3. Add calls in surrounding program for task-graph inputs and outputs

8

Page 9: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

BUILDING A TASK-BASED PROGRAM

Page 10: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

BUILDING MODULES

Task Description

Task 1

Kernel

K1 for T1

Kernel

K1 for T2

Kernel

K1 for T3

Kernel

K1 for T4

Compilers

T1:K1 T1:K2 T1:K3 T2:K1

T2:K2 T3:K1 T4:K1 T4:K2

Module Builder

Module 1

T1:K1 T1:K2 T1:K3

Module 2

T2:K1 T2:K2

Module 3

T3:K1

Module 4

T4:K1 T4:K2

Module Builder

Programmer Defined

Existing Tools

New Tools

Constants

Constants Constants

ConstantsConstants

10

Modules implement a task

Page 11: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

COMPUTATIONAL FUNCTIONS FOR MODULES

// Call CUDA kernel

dim3 dimGrid( NX/TILE_WIDTH, NZ/TILE_WIDTH );

dim3 dimBlock( TILE_WIDTH, TILE_WIDTH );

primitive_variables_ker <<<dimGrid, dimBlock>>>

(p_cons, p_rho, p_vx, p_vy, p_vz,

p_p, p_bx, p_by, p_bz, p_psic);

// Call Intel Xeon Phi code

primitive_variables_mic (p_cons, p_rho, p_vx, p_vy,

p_vz, p_p, p_bx, p_by, p_bz, p_psic);

// Call CPU code

primitive_variables_cpu (p_cons, p_rho, p_vx, p_vy,

p_vz, p_p, p_bx, p_by, p_bz, p_psic);

Develop computational functionality for each targeted processing device

UpdateVariables()

• CPU Kernel

• CUDA Kernel

• OpenCL Kernel

• Xeon Phi Kernel

• [Future architectures]

Module

11

Page 12: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

BUILDING AND EXECUTING GRAPHS

Graph Configuration

• Module 1 Module 2• Module 1 Module 3• Module 2, Module 3

Module 4

Module 1

T1:K1 T1:K2 T1:K3

Module 2

T2:K1 T2:K2

Module 3

T3:K1

Module 4

T4:K1 T4:K2

Constants Constants

ConstantsConstants

Graph Builder

Module 1

Module 2 Module 3

Module 4

12

Programmer Defined

Existing Tools

New Tools

Page 13: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

SCHEDULING TASKS

A B

C

D

E

F

- Name- Inputs- Outputs

A

Packaged Modules Graph

E

D

A

B

C F

Schedule

13

Page 14: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

STATIC VS. DYNAMIC SCHEDULING

14

» Define an execution schedule up front

» Advantages:» No run time overhead

» No supporting scheduler software required

» Disadvantages:» Can miss efficiencies

» Very difficult as applications get large

Static Scheduling

» Run-time task distribution

» Advantages:

» No upfront work by the programmer

» Often outperforms a human

» Disadvantages:

» Adds overhead

Dynamic Scheduling

Page 15: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DYNAMIC GRAPH RUNTIME PROCESSES

Resolve dependencies

Choose device

Allocate resources

Marshall data

Queue task

Invoke

Send outputs

15

Page 16: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

BREAKING SORAD INTO TASKS

Page 17: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DETERMINING TASKS

» Identified key functional components» Broke down and tested each component» Built task versions

» Cloud cover» Chemistry» Clear tau» Cloud tau» Optics (clear and cloudy)» Flux calculation» Flux integration» Absorption» Flux absorption

17

Page 18: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

SORAD GRAPH

18

Page 19: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

CODE TO BUILD SORAD TASK GRAPH (SUBSET)

19

Page 20: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

SORAD GRAPH – SCHEDULER GENERATED

20

Page 21: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DATA INPUT AND OUTPUT

» Source nodes – Data Input

» Can only have output connections

» Data provided directly by the end user

» Sink nodes – Data Output

» Output from the graph is retrieved in an asynchronous manner as it becomes available

» The user associates a callback function with the sink node which will be called once for each chunk of data that arrives at a sink node

Application

Src Src

Sk Sk

Src

Application

21

SORAD Interface

Page 22: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

LINKING TASK SORAD TO EXTERNAL APPLICATION

(INPUTS)

22

Page 23: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

BUILDING SORAD TASKS AND MODULES

Page 24: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

WRITING TASK INTERFACES

Specifies the relevant calling information

24

Page 25: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

EXAMPLE TASK INTERFACE

25

Page 26: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DETERMINING TASKS TO ACCELERATE

» Certain tasks would benefit greatly from being run on a GPU or accelerator device (others would not)

» Rather than iterating over each band for one of these tasks, we can run each calculation simultaneously» Tasks are independent over columns and levels» Also independent over UV and IR bands

» Tasks we focused on for GPU-implementations:» Clear optics» Cloudy optics» Flux» Flux integration

» These tasks represented the bulk of the runtime and are computationally intensive

26

Page 27: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

INITIAL APPROACH: OPENACC

» Easy-to-use directives

» Similar approach to OpenMP

» Portable and potentially usable with single-source

Pros

» Little-to-no data management control within a kernel

» SORAD is in FORTRAN!

» Many features we needed are buggy or unsupported

» Ignores specified directives in favor of compiler "optimizations"

Cons

27

Page 28: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

FINAL IMPLEMENTATION: CUDA

» More control over execution strategy and data management

» Mature technology which supports most common language features

» Using the task-based approach, not locked into CUDA forever

Pros

» NVIDIA only

» Requires significantly more work

» In some cases substantial rewrites necessary to get good performance

Cons

28

Page 29: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

NOTE ON PERFORMANCE TUNING

We performed limited performance tuning

» Selecting block sizes and number of blocks to fully use hardware without incurring extra overhead

» Ensuring data is accessed efficiently

» Temporary data uses registers and shared memory effectively

» Limiting register usage to ensure good occupancy

» Typically 64 or fewer registers per thread is desirable

29

Page 30: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

SCHEDULED SORAD: COMPARISON AND BENCHMARKING

Page 31: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

STATICALLY SCHEDULED SORAD

» Static scheduled implementation in pure Fortran» Benefits of the task-based approach without

additional software» Breaks the problem into chunks along columns

» Overlaps the processing of several chunks

» Memory for entire run can be allocated up-front» Limiting parallelism to a few parallel runs

» CPU performance is somewhat lower when using the PGI compiler compared to Intel

31

Page 32: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DYNAMICALLY SCHEDULED SORAD

Tested with the first prototype of our dynamic task scheduler

» Tasks are registered to the scheduler and connected into a graph

» Data is pushed into and received out of the graph asynchronously

» The scheduler can manage transfers between CPU and GPU automatically

» Allows tasks to be mixed and matched without additional effort

» The results show considerable task-level parallelism» Scheduler still has a lot of room for optimization

32

Page 33: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

TASK-BASED SORAD VERSION

33

Task-Based SORAD

Statically Scheduled

» SORAD broken down into a component tasks and connected in a graph

» Advantages: Easier to maintain, test, and extend; Can be scheduled

» An upfront static schedule can be built to execute the SORAD task graph

» Advantages: Better performance than regular SORAD; Better portability

Dynamically Scheduled

» Could be scheduled with any framework (e.g. Intel’s TBB, etc.)

» Our prototype scheduler showed hybrid operation and significant parallelism

Page 34: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

ORIGINAL SORAD

» All flux» Max percent error – 0.000%» Average percent error – 0.000%

» Upward flux» Max percent error – 0.000%» Average percent error – 0.000%

» Clear flux» Max percent error – 0.000%» Average percent error – 0.000%

» Clear upward flux» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (direct UV)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (diffuse UV)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (direct PAR)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (diffuse PAR)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (direct IR)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (diffuse IR)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (per-band)» Max percent error – 0.000%» Average percent error – 0.000%

34

Execution Time – 5.826 seconds

Page 35: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

STATICALLY SCHEDULED TASK-BASED SORAD (CPU-ONLY)

» All flux» Max percent error – 0.000%» Average percent error – 0.000%

» Upward flux» Max percent error – 0.002%» Average percent error – 0.000%

» Clear flux» Max percent error – 0.000%» Average percent error – 0.000%

» Clear upward flux» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (direct UV)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (diffuse UV)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (direct PAR)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (diffuse PAR)» Max percent error – 0.001%» Average percent error – 0.000%

» Surface flux (direct IR)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (diffuse IR)» Max percent error – 0.001%» Average percent error – 0.000%

» Surface flux (per-band)» Max percent error – 0.000%» Average percent error – 0.000%

35

Execution Time – 3.373 seconds

Page 36: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DYNAMICALLY SCHEDULED TASK-BASED SORAD (CPU-ONLY)

36

Execution Time – 5.788 seconds

Page 37: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

STATICALLY SCHEDULED TASK-BASED SORAD (HYBRID)

» All flux» Max percent error – 1.102%» Average percent error – 0.002%

» Upward flux» Max percent error – 0.026%» Average percent error – 0.002%

» Clear flux» Max percent error – 0.147%» Average percent error – 0.001%

» Clear upward flux» Max percent error – 0.013%» Average percent error – 0.001%

» Surface flux (direct UV)» Max percent error – 0.000%» Average percent error – 0.000%

» Surface flux (diffuse UV)» Max percent error – 0.191%» Average percent error – 0.005%

» Surface flux (direct PAR)» Max percent error – 0.003%» Average percent error – 0.000%

» Surface flux (diffuse PAR)» Max percent error – 0.188%» Average percent error – 0.004%

» Surface flux (direct IR)» Max percent error – 0.016%» Average percent error – 0.000%

» Surface flux (diffuse IR)» Max percent error – 0.188%» Average percent error – 0.005%

» Surface flux (per-band)» Max percent error – 0.191%» Average percent error – 0.002%

37

Execution Time – 2.212 seconds

Page 38: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DYNAMICALLY SCHEDULED TASK-BASED SORAD (HYBRID)

38

Execution Time – 4.7 seconds

Page 39: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

DYNAMICALLY SCHEDULED TASK-BASED SORAD (HYBRID) - MAGNIFIED

39

Execution Time – 4.7 seconds

Page 40: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

PERFORMANCE COMPARISON

Dynamic scheduled versions took about 1 second to handle data allocation

40

0

1

2

3

4

5

6

7

Original Static CPU-only Dynamic CPU-only

Static Hybrid Dynamic Hybrid

Execution Times

Execution Allocation

Page 41: TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can only have output connections » Data provided directly by the end user » Sink nodes

Copyright © 2016 EM Photonics – EM Photonics Proprietary

CONCLUSION

41

Built Task-Based SORAD

Static Scheduling

» Analyzed SORAD and decomposed into a set of tasks and related DAG

» Code is now streamlined, more consistent, and easier to maintain and extend

» Much higher performance than original (both CPU and hybrid versions)

» SORAD small enough that a near optimal schedule can be determined up front

Dynamic Scheduling

» Options: EMP Prototype Scheduler (hybrid system), TBB (single CPU only), etc.

» Faster than original but more useful on larger applications