Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC:...

21
spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler 1 International Conference of Supercomputing (ICS) June 1 2016, Istanbul

Transcript of Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC:...

Page 1: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

Polly-ACC: Transparent Compilation to Heterogeneous Hardware

Tobias Grosser, Torsten Hoefler

1

International Conference of Supercomputing (ICS) June 1 2016, Istanbul

Page 2: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

2

row = 0;

output_image_ptr = output_image;

output_image_ptr += (NN * dead_rows);

for (r = 0; r < NN - KK + 1; r++) {

output_image_offset = output_image_ptr;

output_image_offset += dead_cols;

col = 0;

for (c = 0; c < NN - KK + 1; c++) {

input_image_ptr= input_image;

input_image_ptr+= (NN * row);

kernel_ptr = kernel;

S0: *output_image_offset = 0;

for (i = 0; i < KK; i++) {

input_image_offset = input_image_ptr;

input_image_offset += col;

kernel_offset = kernel_ptr;

for (j = 0; j < KK; j++) {

S1: temp1 = *input_image_offset++;

S1: temp2 = *kernel_offset++;

S1: *output_image_offset += temp1 * temp2;

}

kernel_ptr += KK;

input_image_ptr += NN;

}

S2: *output_image_offset = ((*output_image_offset)/

normal_factor);

output_image_offset++ ;

col++;

}

output_image_ptr += NN;

row++;

}

}

Fortran

C/C++CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

Multi-Core CPU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

Accelerator

Sequential Software Parallel Hardware

Page 3: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

Non-Goal:

Algorithmic Changes

3

Design Goals

Automatic

“Regression Free” High Performance

Page 4: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

Tool: Polyhedral Modeling

Iteration Space

0 1 2 3 4 5

j

i

5

4

3

2

1

0

N = 4

j ≤ i

i ≤ N = 4

0 ≤ j

0 ≤ i

D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i }

(i, j) = (0,0)(1,0)(1,1)(2,0)(2,1)

Program Code

(2,2)(3,0)(3,1)(3,2)(3,3)(4,0)(4,1)(4,2)(4,3)(4,4)

for (i = 0; i <= N; i++)

for (j = 0; j <= i; j++)

S(i,j);

4

Polly -- Performing Polyhedral

Optimizations on a Low-Level

Intermediate Representation

Tobias Grosser et al,

Parallel Processing Letter, 2012

Page 5: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

5

Mapping Computation to Device

0

1

2

0

1

2

0 1 2 3 0 1 2 3

0

0 1

1

Device Blocks & ThreadsIteration Space

𝐵𝐼𝐷 = { 𝑖, 𝑗 →𝑖

4% 2,

𝑗

3% 2 }

0

1

10

i

j

𝑇𝐼𝐷 = { 𝑖, 𝑗 → 𝑖 % 4, 𝑗 % 3 }

Page 6: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

6

Memory Hierarchy of a Heterogeneous System

Page 7: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

7

Host-device date transfers

Page 8: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

8

Host-device date transfers

Page 9: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

9

Mapping onto fast memory

Page 10: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

10

Mapping onto fast memory

Polyhedral parallel code generation for CUDA, Verdoolaege, Sven et. al, ACM

Transactions on Architecture and Code Optimization, 2013

Page 11: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

Profitability Heuristic

Trivial

Unsuitable

Insufficient Compute

static dynamic

Modeling Execution

All

Loop

Nests

GPU

Page 12: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

12

From kernels to program – data transfers

void heat(int n, float A[n], float hot, float cold) {

float B[n] = {0};

initialize(n, A, cold);

setCenter(n, A, hot, n/4);

for (int t = 0; t < T; t++) {

average(n, A, B);

average(n, B, A);

printf("Iteration %d done", t);

}

}

Page 13: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

13

Data Transfer – Per Kernel

Host Memory

initialize()

setCenter()

average()

average()

average()

D → 𝐻

D → 𝐻

𝐻 → 𝐷 𝐷 → 𝐻

tim

e

𝐻 → 𝐷 𝐷 → 𝐻

𝐻 → 𝐷 𝐷 → 𝐻

Device Memory

Page 14: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

14

Data Transfer – Inter Kernel Caching

Host Memory

𝐷 → 𝐻

Host Memory

initialize()

setCenter()

average()

average()

average()

tim

e

𝐻 → 𝐷

Device Memory

Page 15: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

15

EvaluationEvaluation

Workstation: 10 core SandyBridge NVIDIA Titan Black (Kepler)

Mobile: 4 core Haswell NVIDIA GT730M (Kepler)

Page 16: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

1

10

100

1000

10000

SCoPs 0-dim 1-dim 2-dim 3-dim

No Heuristics Heuristics

16

LLVM Nightly Test Suite

# C

om

pute

Regio

ns /

Kern

els

Page 17: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

17

Polybench 3.2

Baseline: icc –O3 (sequential), 10 core CPU + NVIDIA Titan Black (workstation)

Page 18: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

00:00

01:12

02:24

03:36

04:48

06:00

07:12

08:24

Mobile Workstation

icc icc -openmp clang Polly ACC

18

Lattice Boltzmann (SPEC 2006)

Page 19: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

19

Cactus ADM (SPEC 2006)

Work

sta

tion

Mobile

Page 20: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

20

Cactus ADM (SPEC 2006) - Data Transfer

Work

sta

tion

Mobile

Page 21: Polly-ACC: Transparent Compilation to Heterogeneous ... · spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler

spcl.inf.ethz.ch

@spcl_eth

Polly-ACC

21

Automatic

“Regression Free” High Performance

http://spcl.inf.ethz.ch/Polly-ACC