Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based...
-
Upload
shamar-briley -
Category
Documents
-
view
212 -
download
0
Transcript of Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based...
Workshop on HPC in India
Programming Models, Languages, and Compilation for
Accelerator-Based ArchitecturesR. Govindarajan
SERC, [email protected]
ATIP 1st Workshop on HPC in India @ SC-09
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 2
Current Trend in HPC Systems Top500 systems have hundreds of
thousand (100,000s) cores Large HPCs. Performance scaling major challenge
No. of cores in a processor/node is increasing!
4 – 6 cores per processor, 16-24 cores/node! Parallelism even at the node level
Top systems use accelerators GPUs and CellBEs 1000s of cores/proc. Elements in a single GPU!
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 3
HPC Design Using Accelerators High level of performance from Accelerators Variety of general-purpose hardware
accelerators GPUs : nVidia, ATI, Accelerators: Clearspeed, Cell BE, … Plethora of Instruction Sets even for SIMD
Programmable accelerators, e.g., FPGA-based HPC Design using Accelerators
Exploit instruction-level parallelism Exploit data-level parallelism on SIMD units Exploit thread-level parallelism on multiple units/multi-cores
Challenges Portability across different generation and platforms Ability to exploit different types of parallelism
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 4
Accelerators – Cell BE
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 5
Accelerators - 8800 GPU
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 6
The Challenge
SSE
CUDA
OpenCL
ArmNeon
AltiVec
AMD CAL
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 7
Programming in Accelerator-Based Architectures Develop a framework
Programmed in a higher-level language, and is efficient
Can exploit different types of parallelism on different hardware
Parallelism across heterogeneous functional units
Be portable across platforms – not device specific!
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 8
C/C++
CPU
Autovectorizer
SSE/ Altivec
CUDA/OpenCL
CompilernvCC/JIT
CPU
GPUs
PTX/ATI CAL IL
Brook
BrookCompiler
CPU
GPUs
ATI CAL IL
Existing Approaches
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 9
StreaMIT
CellBE RAW
StreamITCompiler
Accelerator
CPU
GPUs
DirectX
Runtime
Std. Compiler
OpenMP
Std. Compiler
CPU
GPUs
Existing Approaches (contd.)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 10
Synergistic Execution on Multiple Hetergeneous Cores
What is needed?
Compiler/Runtime System
CellBE
OtherAceel.
Multicores
GPUsSSE
StreamingLang.
MPIOpenMP
CUDA/OpenCL
ArrayLang. (Matlab)
Parallel Lang.
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 11
What is needed?
StreamingLang.
MPIOpenMP
CUDA/OpenCL
ArrayLang. (Matlab)
Parallel Lang.
CellBE
OtherAceel.
Multicores
GPUsSSE
Synergistic Execution on Multiple Hetergeneous Cores
PLASMA: High-Level IR
Compiler
Runtime System
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 12
Stream Programming Model Higher level programming model where nodes
represent computation and channels communication (producer/consumer relation) between them.
Exposes Pipelined parallelism and Task-level parallelism
Temporal streaming of data Synchronous Data Flow (SDF), Stream Flow
Graph, StreamMIT, Brook, … Compiling techniques for achieving rate-
optimal, buffer-optimal, software-pipelined schedules
Mapping applications to Accelerators such as GPUs and Cell BE.
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 13
Streamit programs are a hierarchical composition of three basic constructs:
Pipeline SplitJoin
• Round-robin or duplicate splitter
Feedback Loop Stateful filters Peek values
...Filter Filter Filter
Splitter
Stream
Stream
Joiner
Joiner Body Splitter
Loop
The StreamIt Language
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 14
More ”natural” than frameworks like CUDA or CTM
Easier learning curve than CUDA No need to think of ”threads” or blocks, StreamIt programs are easier to verify, Schedule can be determined statically.
Why StreamIt on GPUs
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 15
Work distribution across multiprocessors GPUs have hundreds of processing pipes! Exploit task-level and data-level parallelism Schedule across the multiprocessors Multiple concurrent threads in SM to exploit DLP
Execution configuration: task granularity and concurrency
Lack of synchronization between the processors of the GPU.
Managing CPU-GPU memory bandwidth
Issues on Mapping StreamIt for GPUs
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 16
Stream Graph Execution
Stream Graph Software Pipelined Execution
A
C
D
B
SM1 SM2 SM3 SM4
A1 A2
A3 A4
B1 B2
B3 B4 D1
C1
D2
C2
D3
C3
D4
C4
0123
4567
Pipeline Parallelism
Task Parallelism
Data Parallelism
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 17
Our Approach
Our Approach for GPUs Code for SAXPY float->float filter saxpy
{
float a = 2.5f;
work pop 2 push 1 {
float x = pop();
float y = pop();
float s = a * x + y;
push(s);
}
}
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 18
Multithreading Identify good execution configuration to exploit the right
amount of data parallelism Memory
Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced.
Task Partition between GPU and CPU cores Work scheduling and processor (SM)
assignment problem. Takes into account communication bandwidth restrictions
Our Approach (contd.)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 19
Execution Configuration
Exec. Time of Macro Node = 32
Exec. Time of Macro Node = 16
A0 A1 A127
B0 B1 B127 B0 B1 B127
Total Exec. Time on 2 SMs = MII = 64/2 = 32
More threads for exploiting data-level parallelism
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 20
GPUs have a banked memory architecture with a very wide memory channel
Accesses by threads in an SM have to be coalesced
d0 d1 d2 d3 d4 d5 d6 d7
B0 B1 B2 B3 B0 B1 B2 B3
thread0 thread2thread1 thread3
d0 d2 d4 d6 d1 d3 d5 d7
B0 B1 B2 B3 B0 B1 B2 B3
thread0 thread2thread1 thread3
Coalesced Memory Accessing
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 21
Execution on CPU and GPU Problem: Partition work across CPU
and GPU Data transfer between GPU and Host memory
required based on the partition! Coalesced access is efficient for GPU, but harmful
for CPU! Transform data before move from/to GPU memory
Reduce the overall execution time, taking into account memory transfer and transform delays!
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 22
Scheduling and Mapping
CPU Load:45GPU Load:40DMA Load:40 MII:45
B
A
C
D
E
GPU:20
CPU:20
GPU:20
CPU:15
CPU:10
20
10
10
B
A
C
D
E
CPU:10GPU:20
CPU:20
CPU:80GPU:20
CPU:15GPU:10
CPU:10GPU:25
20
10
10
60
Initial StreamIt Graph Partitioned Graph
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 23
Bn-2
Dn-6
En-7
Bn-1 An-1
Bn-3 Cn-3
Dn-5 Cn-5
An
Cn-4
CPU DMA Channel GPU
B
A
C
D
E
GPU:20
CPU:20
GPU:20
CPU:15
CPU:10
20
10
10
Scheduling and Mapping
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 24
Compiler Framework
Execute ProfileRuns
Generate Code for Profiling
ConfigurationSelection
StreamItProgram
TaskPartitioning
TaskPartitioning
ILP Partitioner
Heuristic Partitioner
InstancePartitioning
InstancePartitioning
ModuloScheduling
CodeGeneration
CUDACode
+C Code
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 25
Significant speedup for synergistic execution
Experimental Results on Tesla
> 5
2x
> 3
2x
> 6
5x
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 26
What is needed?
StreamingLang.
MPIOpenMP
CUDA/OpenCL
ArrayLang. (Matlab)
Parallel Lang.
CellBE
OtherAceel.
Multicores
GPUsSSE
Synergistic Execution on Multiple Hetergeneous Cores
PLASMA: High-Level IR
Compiler
Runtime System
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 27
Rich abstractions for Functionality Independence from any single architecture Portability without compromises on
efficiency Scale-up and scale down
Single core embedded processor to multi-core workstation
Take advantage of Accelerators (GPU, Cell, …)
Transparent Distributed Memory
PLASMA: Portable Programming for PLASTIC SIMD Accelerators
IR: What should a solution provide?
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 28
PLASMA IR
Reduce Add
Par Mul
Slice V
M
Matrix-Vector Multiply
par mul, temp, A[i *n : i *n+n :
1], X
reduce add, Y[I : i+1 : 1], temp
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 29
“CPLASM”, a prototype high-level assembly language
Prototype PLASMA IR Compiler
Currently Supported Targets:C (Scalar), SSE3, CUDA (NVIDIA
GPUs) Future Targets:
Cell, ATI, ARM Neon, ... Compiler Optimizations for this
“Vector” IR
Our Framework
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 30
Our Framework (contd.)
Plenty of optimization opportunities!
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 31
PLASMA IR Performance
Normalized exec. Time comparable to that of hand-tuned library!
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 32
Ongoing Work
StreamingLang.
MPI OpenMP
CUDA/OpenCL
ArrayLang. (Matlab)
Parallel Lang.
CellBE
OtherAceel.
Multicores
GPUsSSE
Synergistic Execution on Multiple Hetergeneous Cores
PLASMA: High-Level IR
Compiler
Runtime System
Look at other high level languages !
Target other accelerators
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 33
Compiling OpenMP/MPI / X10 Mapping the semantics Exploiting data parallelism and
task parallelism Communication and
synchronization across CPU/GPU/Multiple Nodes
Accelerator-specific optimization Memory layout, memory transfer, …
Performance and Scaling
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 34
Thank You !!
My students! IISc and SERC Microsoft and Nvidia ATIP, NSF, all Sponsors ONR
Acknowledgements