TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can...
Transcript of TASK-BASED DYNAMIC SCHEDULING APPROACH TO A NASA’€¦ · » Source nodes –Data Input » Can...
TASK-BASED DYNAMIC
SCHEDULING APPROACH TO
ACCELERATING NASA’SGEOS-5
Eric Kelmelis
April 5, 2016
Copyright © 2016 EM Photonics – EM Photonics Proprietary
OUTLINE
TASK-BASED PROGRAMMING
SORAD IMPLEMENTATION
RESULTS
Concepts and goals
Converting SORAD (GEOS-5) for task paradigm
Testing and benchmarking
2
Copyright © 2016 EM Photonics – EM Photonics Proprietary
MODERN HIGH-PERFORMANCE COMPUTING NODE
3
Node
CPU CPU
GPU GPU
Node
CPU
High-performance computing nodes have evolved
Copyright © 2016 EM Photonics – EM Photonics Proprietary
MOTIVATION
TAKE ADVANTAGE OF HYBRID SYSTEMS
EFFICIENT RESOURCE UTILIZATION
TECHNICAL EXPERTISE
NASA Codes are running on increasingly varied, heterogeneous computer systems
Utilizing computational resources effectively is difficult as their properties become less regular
Scientists do not often have sufficient time or expertise to performance tune their codes
4
Copyright © 2016 EM Photonics – EM Photonics Proprietary
PROPOSED SOLUTION: TASK-BASED PROGRAMMING
» A task is pure function which receives inputs and delivers outputs according to a task graph
» Details of task invocation can be abstracted from the programmer
» Tasks can be developed, tested, and characterized as independent units
» Synchronization between tasks is defined purely by connections in the task graph
5
Copyright © 2016 EM Photonics – EM Photonics Proprietary
PRINCIPLES OF A TASK
» Managing data:
» Tasks cannot depend on mutable state other than explicit inputs
» Tasks cannot modify state other than explicit outputs
» These constraints allow us to easily draw a graph of dependencies between tasks
6
Copyright © 2016 EM Photonics – EM Photonics Proprietary
TASK GRAPHS
» Define the dependencies between tasks
» Allows creating a schedule for executing them
» Can be conceptualized as data flowing through a task graph
» Data transfer easy to model by looking at dependencies
» Tasks can be divided among threads and devices for concurrent execution
7
Copyright © 2016 EM Photonics – EM Photonics Proprietary
TASK-GRAPH CONFIGURATION
Three things must be done to convert (part of) a program to a task-graph:
1. Write task definitions
2. Specify task-graph
3. Add calls in surrounding program for task-graph inputs and outputs
8
BUILDING A TASK-BASED PROGRAM
Copyright © 2016 EM Photonics – EM Photonics Proprietary
BUILDING MODULES
Task Description
Task 1
Kernel
K1 for T1
Kernel
K1 for T2
Kernel
K1 for T3
Kernel
K1 for T4
Compilers
T1:K1 T1:K2 T1:K3 T2:K1
T2:K2 T3:K1 T4:K1 T4:K2
Module Builder
Module 1
T1:K1 T1:K2 T1:K3
Module 2
T2:K1 T2:K2
Module 3
T3:K1
Module 4
T4:K1 T4:K2
Module Builder
Programmer Defined
Existing Tools
New Tools
Constants
Constants Constants
ConstantsConstants
10
Modules implement a task
Copyright © 2016 EM Photonics – EM Photonics Proprietary
COMPUTATIONAL FUNCTIONS FOR MODULES
// Call CUDA kernel
dim3 dimGrid( NX/TILE_WIDTH, NZ/TILE_WIDTH );
dim3 dimBlock( TILE_WIDTH, TILE_WIDTH );
primitive_variables_ker <<<dimGrid, dimBlock>>>
(p_cons, p_rho, p_vx, p_vy, p_vz,
p_p, p_bx, p_by, p_bz, p_psic);
// Call Intel Xeon Phi code
primitive_variables_mic (p_cons, p_rho, p_vx, p_vy,
p_vz, p_p, p_bx, p_by, p_bz, p_psic);
// Call CPU code
primitive_variables_cpu (p_cons, p_rho, p_vx, p_vy,
p_vz, p_p, p_bx, p_by, p_bz, p_psic);
Develop computational functionality for each targeted processing device
UpdateVariables()
• CPU Kernel
• CUDA Kernel
• OpenCL Kernel
• Xeon Phi Kernel
• [Future architectures]
Module
11
Copyright © 2016 EM Photonics – EM Photonics Proprietary
BUILDING AND EXECUTING GRAPHS
Graph Configuration
• Module 1 Module 2• Module 1 Module 3• Module 2, Module 3
Module 4
Module 1
T1:K1 T1:K2 T1:K3
Module 2
T2:K1 T2:K2
Module 3
T3:K1
Module 4
T4:K1 T4:K2
Constants Constants
ConstantsConstants
Graph Builder
Module 1
Module 2 Module 3
Module 4
12
Programmer Defined
Existing Tools
New Tools
Copyright © 2016 EM Photonics – EM Photonics Proprietary
SCHEDULING TASKS
A B
C
D
E
F
- Name- Inputs- Outputs
A
Packaged Modules Graph
E
D
A
B
C F
Schedule
13
Copyright © 2016 EM Photonics – EM Photonics Proprietary
STATIC VS. DYNAMIC SCHEDULING
14
» Define an execution schedule up front
» Advantages:» No run time overhead
» No supporting scheduler software required
» Disadvantages:» Can miss efficiencies
» Very difficult as applications get large
Static Scheduling
» Run-time task distribution
» Advantages:
» No upfront work by the programmer
» Often outperforms a human
» Disadvantages:
» Adds overhead
Dynamic Scheduling
Copyright © 2016 EM Photonics – EM Photonics Proprietary
DYNAMIC GRAPH RUNTIME PROCESSES
Resolve dependencies
Choose device
Allocate resources
Marshall data
Queue task
Invoke
Send outputs
15
BREAKING SORAD INTO TASKS
Copyright © 2016 EM Photonics – EM Photonics Proprietary
DETERMINING TASKS
» Identified key functional components» Broke down and tested each component» Built task versions
» Cloud cover» Chemistry» Clear tau» Cloud tau» Optics (clear and cloudy)» Flux calculation» Flux integration» Absorption» Flux absorption
17
Copyright © 2016 EM Photonics – EM Photonics Proprietary
SORAD GRAPH
18
Copyright © 2016 EM Photonics – EM Photonics Proprietary
CODE TO BUILD SORAD TASK GRAPH (SUBSET)
19
Copyright © 2016 EM Photonics – EM Photonics Proprietary
SORAD GRAPH – SCHEDULER GENERATED
20
Copyright © 2016 EM Photonics – EM Photonics Proprietary
DATA INPUT AND OUTPUT
» Source nodes – Data Input
» Can only have output connections
» Data provided directly by the end user
» Sink nodes – Data Output
» Output from the graph is retrieved in an asynchronous manner as it becomes available
» The user associates a callback function with the sink node which will be called once for each chunk of data that arrives at a sink node
Application
Src Src
Sk Sk
Src
Application
21
SORAD Interface
Copyright © 2016 EM Photonics – EM Photonics Proprietary
LINKING TASK SORAD TO EXTERNAL APPLICATION
(INPUTS)
22
BUILDING SORAD TASKS AND MODULES
Copyright © 2016 EM Photonics – EM Photonics Proprietary
WRITING TASK INTERFACES
Specifies the relevant calling information
24
Copyright © 2016 EM Photonics – EM Photonics Proprietary
EXAMPLE TASK INTERFACE
25
Copyright © 2016 EM Photonics – EM Photonics Proprietary
DETERMINING TASKS TO ACCELERATE
» Certain tasks would benefit greatly from being run on a GPU or accelerator device (others would not)
» Rather than iterating over each band for one of these tasks, we can run each calculation simultaneously» Tasks are independent over columns and levels» Also independent over UV and IR bands
» Tasks we focused on for GPU-implementations:» Clear optics» Cloudy optics» Flux» Flux integration
» These tasks represented the bulk of the runtime and are computationally intensive
26
Copyright © 2016 EM Photonics – EM Photonics Proprietary
INITIAL APPROACH: OPENACC
» Easy-to-use directives
» Similar approach to OpenMP
» Portable and potentially usable with single-source
Pros
» Little-to-no data management control within a kernel
» SORAD is in FORTRAN!
» Many features we needed are buggy or unsupported
» Ignores specified directives in favor of compiler "optimizations"
Cons
27
Copyright © 2016 EM Photonics – EM Photonics Proprietary
FINAL IMPLEMENTATION: CUDA
» More control over execution strategy and data management
» Mature technology which supports most common language features
» Using the task-based approach, not locked into CUDA forever
Pros
» NVIDIA only
» Requires significantly more work
» In some cases substantial rewrites necessary to get good performance
Cons
28
Copyright © 2016 EM Photonics – EM Photonics Proprietary
NOTE ON PERFORMANCE TUNING
We performed limited performance tuning
» Selecting block sizes and number of blocks to fully use hardware without incurring extra overhead
» Ensuring data is accessed efficiently
» Temporary data uses registers and shared memory effectively
» Limiting register usage to ensure good occupancy
» Typically 64 or fewer registers per thread is desirable
29
SCHEDULED SORAD: COMPARISON AND BENCHMARKING
Copyright © 2016 EM Photonics – EM Photonics Proprietary
STATICALLY SCHEDULED SORAD
» Static scheduled implementation in pure Fortran» Benefits of the task-based approach without
additional software» Breaks the problem into chunks along columns
» Overlaps the processing of several chunks
» Memory for entire run can be allocated up-front» Limiting parallelism to a few parallel runs
» CPU performance is somewhat lower when using the PGI compiler compared to Intel
31
Copyright © 2016 EM Photonics – EM Photonics Proprietary
DYNAMICALLY SCHEDULED SORAD
Tested with the first prototype of our dynamic task scheduler
» Tasks are registered to the scheduler and connected into a graph
» Data is pushed into and received out of the graph asynchronously
» The scheduler can manage transfers between CPU and GPU automatically
» Allows tasks to be mixed and matched without additional effort
» The results show considerable task-level parallelism» Scheduler still has a lot of room for optimization
32
Copyright © 2016 EM Photonics – EM Photonics Proprietary
TASK-BASED SORAD VERSION
33
Task-Based SORAD
Statically Scheduled
» SORAD broken down into a component tasks and connected in a graph
» Advantages: Easier to maintain, test, and extend; Can be scheduled
» An upfront static schedule can be built to execute the SORAD task graph
» Advantages: Better performance than regular SORAD; Better portability
Dynamically Scheduled
» Could be scheduled with any framework (e.g. Intel’s TBB, etc.)
» Our prototype scheduler showed hybrid operation and significant parallelism
Copyright © 2016 EM Photonics – EM Photonics Proprietary
ORIGINAL SORAD
» All flux» Max percent error – 0.000%» Average percent error – 0.000%
» Upward flux» Max percent error – 0.000%» Average percent error – 0.000%
» Clear flux» Max percent error – 0.000%» Average percent error – 0.000%
» Clear upward flux» Max percent error – 0.000%» Average percent error – 0.000%
» Surface flux (direct UV)» Max percent error – 0.000%» Average percent error – 0.000%
» Surface flux (diffuse UV)» Max percent error – 0.000%» Average percent error – 0.000%
» Surface flux (direct PAR)» Max percent error – 0.000%» Average percent error – 0.000%
» Surface flux (diffuse PAR)» Max percent error – 0.000%» Average percent error – 0.000%
» Surface flux (direct IR)» Max percent error – 0.000%» Average percent error – 0.000%
» Surface flux (diffuse IR)» Max percent error – 0.000%» Average percent error – 0.000%
» Surface flux (per-band)» Max percent error – 0.000%» Average percent error – 0.000%
34
Execution Time – 5.826 seconds
Copyright © 2016 EM Photonics – EM Photonics Proprietary
STATICALLY SCHEDULED TASK-BASED SORAD (CPU-ONLY)
» All flux» Max percent error – 0.000%» Average percent error – 0.000%
» Upward flux» Max percent error – 0.002%» Average percent error – 0.000%
» Clear flux» Max percent error – 0.000%» Average percent error – 0.000%
» Clear upward flux» Max percent error – 0.000%» Average percent error – 0.000%
» Surface flux (direct UV)» Max percent error – 0.000%» Average percent error – 0.000%
» Surface flux (diffuse UV)» Max percent error – 0.000%» Average percent error – 0.000%
» Surface flux (direct PAR)» Max percent error – 0.000%» Average percent error – 0.000%
» Surface flux (diffuse PAR)» Max percent error – 0.001%» Average percent error – 0.000%
» Surface flux (direct IR)» Max percent error – 0.000%» Average percent error – 0.000%
» Surface flux (diffuse IR)» Max percent error – 0.001%» Average percent error – 0.000%
» Surface flux (per-band)» Max percent error – 0.000%» Average percent error – 0.000%
35
Execution Time – 3.373 seconds
Copyright © 2016 EM Photonics – EM Photonics Proprietary
DYNAMICALLY SCHEDULED TASK-BASED SORAD (CPU-ONLY)
36
Execution Time – 5.788 seconds
Copyright © 2016 EM Photonics – EM Photonics Proprietary
STATICALLY SCHEDULED TASK-BASED SORAD (HYBRID)
» All flux» Max percent error – 1.102%» Average percent error – 0.002%
» Upward flux» Max percent error – 0.026%» Average percent error – 0.002%
» Clear flux» Max percent error – 0.147%» Average percent error – 0.001%
» Clear upward flux» Max percent error – 0.013%» Average percent error – 0.001%
» Surface flux (direct UV)» Max percent error – 0.000%» Average percent error – 0.000%
» Surface flux (diffuse UV)» Max percent error – 0.191%» Average percent error – 0.005%
» Surface flux (direct PAR)» Max percent error – 0.003%» Average percent error – 0.000%
» Surface flux (diffuse PAR)» Max percent error – 0.188%» Average percent error – 0.004%
» Surface flux (direct IR)» Max percent error – 0.016%» Average percent error – 0.000%
» Surface flux (diffuse IR)» Max percent error – 0.188%» Average percent error – 0.005%
» Surface flux (per-band)» Max percent error – 0.191%» Average percent error – 0.002%
37
Execution Time – 2.212 seconds
Copyright © 2016 EM Photonics – EM Photonics Proprietary
DYNAMICALLY SCHEDULED TASK-BASED SORAD (HYBRID)
38
Execution Time – 4.7 seconds
Copyright © 2016 EM Photonics – EM Photonics Proprietary
DYNAMICALLY SCHEDULED TASK-BASED SORAD (HYBRID) - MAGNIFIED
39
Execution Time – 4.7 seconds
Copyright © 2016 EM Photonics – EM Photonics Proprietary
PERFORMANCE COMPARISON
Dynamic scheduled versions took about 1 second to handle data allocation
40
0
1
2
3
4
5
6
7
Original Static CPU-only Dynamic CPU-only
Static Hybrid Dynamic Hybrid
Execution Times
Execution Allocation
Copyright © 2016 EM Photonics – EM Photonics Proprietary
CONCLUSION
41
Built Task-Based SORAD
Static Scheduling
» Analyzed SORAD and decomposed into a set of tasks and related DAG
» Code is now streamlined, more consistent, and easier to maintain and extend
» Much higher performance than original (both CPU and hybrid versions)
» SORAD small enough that a near optimal schedule can be determined up front
Dynamic Scheduling
» Options: EMP Prototype Scheduler (hybrid system), TBB (single CPU only), etc.
» Faster than original but more useful on larger applications