Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC:...
Transcript of Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC:...
spcl.inf.ethz.ch
@spcl_eth
Polly-ACC: Transparent Compilation to Heterogeneous Hardware
Tobias Grosser, Torsten Hoefler
1
International Conference of Supercomputing (ICS) June 1 2016, Istanbul
spcl.inf.ethz.ch
@spcl_eth
2
row = 0;
output_image_ptr = output_image;
output_image_ptr += (NN * dead_rows);
for (r = 0; r < NN - KK + 1; r++) {
output_image_offset = output_image_ptr;
output_image_offset += dead_cols;
col = 0;
for (c = 0; c < NN - KK + 1; c++) {
input_image_ptr= input_image;
input_image_ptr+= (NN * row);
kernel_ptr = kernel;
S0: *output_image_offset = 0;
for (i = 0; i < KK; i++) {
input_image_offset = input_image_ptr;
input_image_offset += col;
kernel_offset = kernel_ptr;
for (j = 0; j < KK; j++) {
S1: temp1 = *input_image_offset++;
S1: temp2 = *kernel_offset++;
S1: *output_image_offset += temp1 * temp2;
}
kernel_ptr += KK;
input_image_ptr += NN;
}
S2: *output_image_offset = ((*output_image_offset)/
normal_factor);
output_image_offset++ ;
col++;
}
output_image_ptr += NN;
row++;
}
}
Fortran
C/C++CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Multi-Core CPU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
Accelerator
Sequential Software Parallel Hardware
spcl.inf.ethz.ch
@spcl_eth
Non-Goal:
Algorithmic Changes
3
Design Goals
Automatic
“Regression Free” High Performance
spcl.inf.ethz.ch
@spcl_eth
Tool: Polyhedral Modeling
Iteration Space
0 1 2 3 4 5
j
i
5
4
3
2
1
0
N = 4
j ≤ i
i ≤ N = 4
0 ≤ j
0 ≤ i
D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i }
(i, j) = (0,0)(1,0)(1,1)(2,0)(2,1)
Program Code
(2,2)(3,0)(3,1)(3,2)(3,3)(4,0)(4,1)(4,2)(4,3)(4,4)
for (i = 0; i <= N; i++)
for (j = 0; j <= i; j++)
S(i,j);
4
Polly -- Performing Polyhedral
Optimizations on a Low-Level
Intermediate Representation
Tobias Grosser et al,
Parallel Processing Letter, 2012
spcl.inf.ethz.ch
@spcl_eth
5
Mapping Computation to Device
0
1
2
0
1
2
0 1 2 3 0 1 2 3
0
0 1
1
Device Blocks & ThreadsIteration Space
𝐵𝐼𝐷 = { 𝑖, 𝑗 →𝑖
4% 2,
𝑗
3% 2 }
0
1
10
i
j
𝑇𝐼𝐷 = { 𝑖, 𝑗 → 𝑖 % 4, 𝑗 % 3 }
spcl.inf.ethz.ch
@spcl_eth
6
Memory Hierarchy of a Heterogeneous System
spcl.inf.ethz.ch
@spcl_eth
7
Host-device date transfers
spcl.inf.ethz.ch
@spcl_eth
8
Host-device date transfers
spcl.inf.ethz.ch
@spcl_eth
9
Mapping onto fast memory
spcl.inf.ethz.ch
@spcl_eth
10
Mapping onto fast memory
Polyhedral parallel code generation for CUDA, Verdoolaege, Sven et. al, ACM
Transactions on Architecture and Code Optimization, 2013
spcl.inf.ethz.ch
@spcl_eth
Profitability Heuristic
Trivial
Unsuitable
Insufficient Compute
static dynamic
Modeling Execution
All
Loop
Nests
GPU
spcl.inf.ethz.ch
@spcl_eth
12
From kernels to program – data transfers
void heat(int n, float A[n], float hot, float cold) {
float B[n] = {0};
initialize(n, A, cold);
setCenter(n, A, hot, n/4);
for (int t = 0; t < T; t++) {
average(n, A, B);
average(n, B, A);
printf("Iteration %d done", t);
}
}
spcl.inf.ethz.ch
@spcl_eth
13
Data Transfer – Per Kernel
Host Memory
initialize()
setCenter()
average()
average()
average()
D → 𝐻
D → 𝐻
𝐻 → 𝐷 𝐷 → 𝐻
tim
e
𝐻 → 𝐷 𝐷 → 𝐻
𝐻 → 𝐷 𝐷 → 𝐻
Device Memory
spcl.inf.ethz.ch
@spcl_eth
14
Data Transfer – Inter Kernel Caching
Host Memory
𝐷 → 𝐻
Host Memory
initialize()
setCenter()
average()
average()
average()
tim
e
𝐻 → 𝐷
Device Memory
spcl.inf.ethz.ch
@spcl_eth
15
EvaluationEvaluation
Workstation: 10 core SandyBridge NVIDIA Titan Black (Kepler)
Mobile: 4 core Haswell NVIDIA GT730M (Kepler)
spcl.inf.ethz.ch
@spcl_eth
1
10
100
1000
10000
SCoPs 0-dim 1-dim 2-dim 3-dim
No Heuristics Heuristics
16
LLVM Nightly Test Suite
# C
om
pute
Regio
ns /
Kern
els
spcl.inf.ethz.ch
@spcl_eth
17
Polybench 3.2
Baseline: icc –O3 (sequential), 10 core CPU + NVIDIA Titan Black (workstation)
spcl.inf.ethz.ch
@spcl_eth
00:00
01:12
02:24
03:36
04:48
06:00
07:12
08:24
Mobile Workstation
icc icc -openmp clang Polly ACC
18
Lattice Boltzmann (SPEC 2006)
spcl.inf.ethz.ch
@spcl_eth
19
Cactus ADM (SPEC 2006)
Work
sta
tion
Mobile
spcl.inf.ethz.ch
@spcl_eth
20
Cactus ADM (SPEC 2006) - Data Transfer
Work
sta
tion
Mobile
spcl.inf.ethz.ch
@spcl_eth
Polly-ACC
21
Automatic
“Regression Free” High Performance
http://spcl.inf.ethz.ch/Polly-ACC