Automatic Data Placement Into GPU On-Chip Memory Resources Chao Li Yi Yang North Carolina State...

27
Automatic Data Placement Into GPU On-Chip Memory Resources Chao Li Yi Yang North Carolina State University NEC Labs America Zhen Lin Huiyang Zhou North Carolina State University North Carolina State University 1 www.nec-labs.com

Transcript of Automatic Data Placement Into GPU On-Chip Memory Resources Chao Li Yi Yang North Carolina State...

Automatic Data Placement Into GPU On-Chip Memory Resources

Chao Li Yi Yang North Carolina State University NEC Labs America

Zhen Lin Huiyang Zhou North Carolina State University North Carolina State University

1www.nec-labs.com

GPUs rely on thread-level parallelism to hide off-chip latency. Judicious utilization of on-chip memory resources remains critical

to performance. The off-chip memory bandwidth still a bottleneck. E.g., Big Data Applications, Deep Learning on GPUs.

Two key challenges: Explicitly managing the intricate on-chip resources Performance portability across different GPU generations.

Our solution: Automatic data placement into GPU memory resources Compiler-driven automatic data placement Focus on programs that have been reasonably optimized Revise data placement to achieve both performance enhancement and

performance portability.

Introduction

2

Three types on on-chip memory resources: registers, shared memory, and D-caches

Different Capacity and Allocation Restrictions:

Large register file, small cache;

64 registers per thread, 48KB shared memory per TB, no limit on D-cache.

Different Accessibility: Register File: within threads (warps in Kerpler); Shared Memory: within a thread block;

D-cache: TBs on the same SM.

Different Performance Characteristics: Register File: highest bandwidth; Shared Memory: high bandwidth with fixed data

access latency; D-cache: high bandwidth with variable access latency.

Explicit Resource Management

3

GPUs evolve at a very fast pace

There is a higher increase in computation throughput than off-chip

bandwidth: Ratio of GFLOS/GB: 10X-> 15X (GTX8800-> GTX680).

Register file and D-cache/shared memory size have been changing across

different generations.

G80 (GTX 8800)

GT200(GTX 280)

FERMI(GTX 480)

KEPLER(GTX 680)

KEPLER (K20c)

Arithmetic throughput(Gflops/S)

504 933 1345 3090 3950

Memory Bandwidth

(GB/S)57 141 177 192 250

Shared memory size(KB)

16 16 48 48 48

Register file size(KB) 32 64 128 256 256

4

Performance Portability

A Compiler Algorithm For Automatic Data

Placement

Analyze Possible Data Placement Patterns

Construct Compiler Algorithms to Use the

Profitable Patterns

Our Solution

5

Data movement from one to another to achieve optimal resource utilization.

Data (re)Placement Patterns:

Direction 6: : compiler determined, e.g. A[B[0]], B[12];

Directions a and 5: :previous works on specific optimizations, significant

code changes; the trend of GPU evolution: larger register files

Directions , and : our focus.

6

Data (re)PlacementRegister variables

Shared memory variables

Local/global variables in L1 D-caches

1

2

36

4

5

6

4 5

1

2

3

Pattern 1: Shared Memory to Registers

Three reasons:

Shared memory usage may limit the

TLP;

Shared memory has longer access

latency and lower bandwidth;

Accessing shared memory incurs

instruction overhead for address

computation.

Promotion strategy on multiple

promotable shared memory

variables: reference count-based

priority.

__global__ void dynproc_kernel(…){__shared__ float prev[256];__shared__ float result [256];int tx=threadIdx.x ; for (int i=0; i<iteration ; i++){…. shortest = minum( prev[tx-1], prev[tx],prev[tx+1]); result[tx] = shortest + gpuWall[index]; __syncthreads(); prev[tx]= result[tx]; __syncthreads();} gpuResults[xidx]=result[tx]; }

Baseline

__global__ void dynproc_kernel(…){__shared__ float prev[256];float result;int tx=threadIdx.x ;for (int i=0; i<iteration ; i++){… shortest = minum( prev[tx-1], prev[tx],prev[tx+1]); result = shortest + gpuWall[index]; syncthreads(); prev[tx]= result; __syncthreads();}gpuResults[xidx]=result;}

Optimized Code7

Pattern 2: Shared Memory to L1 D-caches Three reasons:

Shared memory usage may limit the TLP

but cannot promote to registers

Local/global memory implicitly utilizes L1 D-

cache to achieve high performance;

The communication among threads can

also be ensured by global memory

variables.

To balance the tradeoff between TLP

and memory pressure, auto-tuning is

employed to determine :

Which variables to be promoted;

Whether into global or local memory.

__global__ void generateTriangles(…) {

__shared__ float3 vertlist[12*NTHREADS]; //12*32

__shared__ float3 normlist[12*NTHREADS];

//defines to the shared memory array

vertexInterp2(..., vertlist[threadIdx.x],

normlist[threadIdx.x]));

vertexInterp2(…,vertlist[threadIdx.x+NTHREADS],

normlist[threadIdx.x+NTHREADS]);

…edge = tech1Dfetch (triTex,..) ;

//uses of the shared memory array

pos[index] =

make_float4(vertlist[(edge*NTHREADS)+threadIdx.x],

1.0f);

} Baseline

__global__ void generateTriangles(…) { float3 vertlist[12]; float3 normlist[12]; //defines to the local memory array vertexInterp2(.., vertlist[0], normlist[0]); vertexInterp2(…,vertlist[1], normlist[1]); … edge = tech1Dfetch(triTex,..);

//uses of the local memory array pos[index] = make_float4 (vertlist[edge], 1.0f); …}

Optimized Code 8

Pattern 3: Shared Memory/D-cache to Registers to Achieve Register Tiling

Two reasons:

Common side effect of SPMD: redundant

computations and memory accesses;

Redundant shared/global memory usage can be

converted into register usage.

Three ways for saving bandwidth:

Implicitly L1 Data-cache: cache hit, but data may

be evicted out by others;

Shared memory: only select one warp for loading

task, additional control flow and _sync;

Register file: not shared among warps, so

compact warps of threads first. Introduce

C_Factor for best register tiling.

__global__ void srad_kernel(int [] c_cuda…){

int index_s = cols * BLOCK_SIZE * by + BLOCK_SIZE * bx

+ cols * BLOCK_SIZE + tx; //BLOCK_SIZE = 16;

__shared__ float south_c[BLOCK_SIZE][BLOCK_SIZE];

….

south_c[ty][tx] = c_cuda[index_s];

if ( by == gridDim.y - 1 ) south_c[ty][tx] = …

__syncthreads();

…}

Baseline

__global__ void srad_kernel(int [] c_cuda…){

int index_s = cols * BLOCK_SIZE * by + BLOCK_SIZE * bx

+ cols * BLOCK_SIZE + tx; //BLOCK_SIZE = 16;

__shared__ float south_c[BLOCK_SIZE][BLOCK_SIZE];

….

int tmp_1= c_cuda[index_s];

#pragma unroll

for(int m=0;m<C_Factor; m++)

south_c[ty+ m*blockDim.y/C_Factor][tx] = tmp_1;

if ( by == gridDim.y - 1 ) south_c[ty][tx] = …

__syncthreads();

…}

Optimized Code 9

Analyze Possible Data Placement Patterns

Compiler Algorithms to Utilize the Profitable

Patterns

Our Solution

5

Compiler Algorithms

Compiler pass 1:(patterns 1 & 2)

Compiler pass 2:(pattern 3)

Identification Stage

Processing Stage

Auto-tuning Stage

Identification Stage

Processing Stage

Auto-tuning StageThree stages for one compiler pass:

Identifying stage: scan and generate a list of candidate variables by collecting the architecture features and analyzing memory access behavior;

Processing stage: implement the placement patterns;

Auto-tuning stage: construct the search space, decide which variables to be processed and achieve best code generation. 11

Identify and collect shared memory variables

Analyze memory access behavior

Is the access across threads?

Memory reference counts and

allocation size

The access index is

decided at runtime?

Candidate variables list

On each candidate variable in the list with priority order

! (a) && !(b)

(a)

(b)

!(a) && (b)

Promote to register file

Promote to local

memory

Promote toglobal

memory

Auto-tuning

for optimal kernel

Compiler Pass 1

12

Y N

Y N

Generate a new kernel

Input Kernel

Input Kernel

Identify and analyze the access behavior of global

and shared memory variables

Check the redundancy along x or y dimension;

generate redundancy type

Collect the expressions with indices featured with

redundancy

Adjust the thread block dimension for each different C_Factor

Construct unroll-able loop for thread compaction /coarsening/merge

Dump out the expr list

The expressions in expr list will be performed once (i.e., no redundancy)

Auto-tuning

for optimal kernel

Compiler Pass 2

13

Generate a new kernel

Auto-Tuning

Auto-tuning steps: Construct a search space based on tunable parameters;

Measure the execution time;

Select the best preforming code variant for the target architecture.

Three search spaces constructed for data placement: How many and which shared memory variables should be promoted into register file;

Which shared memory variables to be promoted into local/global memory;

The compaction factor.

Search space pruning strategies: Memory reference-count based priority;

Allocation size-based priority;

Limit the compact factor to 2’s powers.

14

Preprocessor

Memory access index regulation: An affine function of thread index;

Scaling factor: macro/constant variables, kernel launch parameter, run-time

parameters.

Dynamic loop bound: Let the user to provide the info through profiling; or

Use simple heuristics: a default loop count of 4.

Collect data structure declaration and annotate data type: int2, float4 vector type: being processed the same as int, float;

User-define struct type: identified separately.

15

Experimental Methodology

Implementation: Implement into Cetus, a source-to-source framework;

Basic CUDA syntax support from MCUDA.

Evaluation Environment:

Three generations with all possible D-cache and shared memory capacity

configurations.

Parameter GTX480 GTX680 K20c

<Shared memory size,

L1 D-cache size>

<16kB, 48kB>, <48kB, 16kB>

<16kB, 48kB>, <32kB, 32kB>, <48kB, 16kB>

<16kB, 48kB>, <32kB, 32kB>, <48kB, 16kB>

Register file size 128kB 256kB 256kB

Max number of threads per SM 512 1024 1536

Max number of registers per

thread64 64 256

Compaction Factor 2,4,8,16 2,4,8,16 2,4,8,16

16

Shared memory allocation size defined by programmer;

Initial register allocation controlled statically by compiler and architecture parameter.

GTX480 GTX680 K20C

Benchmark Input reg smem reg smem reg smem

HotSpot (HS) height 2 35 3072 36 3072 39 3072Back Prop1 (BP1) 65536 layer 13 1088 11 1088 12 1088Back Prop 2 (BP2) 65536 layer 22 0 20 0 21 0

SRAD1 (SR1) 2048*2048 20 0 20 0 26 0SRAD2 (SR2) 2048*2048 19 0 20 0 20 0

Matrix Multiply (MM) 2048*2048 23 8192 26 8192 25 8192

Path Finder (PF) 409600 steps 16 2048 18 2048 17 2048

N-Queue (NQU) N=8 15 15744 19 15744 16 15744

Marching Cubes (MC) 32768 voxels 63 9216 63 9216 76 9216

B+tree1 (BT1) qrSize=6000 18 0 19 0 21 0B+tree2 (BT2) qrSize=6000 23 0 28 0 30 0

Lu-Decompose (LUD) 2048.dat 15 2048 17 2048 17 2048

17

Benchmarks

Performance Gains from Automatic Data Placement

HSBP1

BP2SR1

SR2M

M PFNQU

MC

BT1BT2

LUD

GMEAN

0.51

1.52

2.53

3.54

4.5GTX480 GTX680 K20c

Spe

edup

Measurement: Baseline: Select the best one for original kernel by trying all different shared memory/L1D

cache sizes

For each device, generate the kernel with the optimal data placement choices.

Result: GTX480: Up to 4.14X, Average of 1.76X; GTX680: Up to 3.30X, Average of 1.61X; K20C: Up to

2.44X, Average of 1.48X. 18

Optimal Parameters (the number of shared memory array to be promoted or the C-Factor) for Different GPUs

HS BP1 BP2 SR1 SR2 MM PF NQU MC BT1 BT2 LUD0

3

6

9

12

15

18GTX480

GTX680

k20c

Sea

rch

Sp

ace

Par

amet

er

Performance Portability: Our compiler intelligently generates the optimized kernel for specific architecture;

Different architecture features of these GPUs lead to different optimal parameters. 20

Auto-Tuning

Effective pruning: Search space has been reduced

significantly;

Performance of the optimized kernel

not impacted.

Original search space

Pruned search space

HS 48 8BP1 16 3BP2 16 4SR1 16 5SR2 16 5MM 32 5PF 1 1

NQU 45 12MC 9 6BT1 3 3BT2 3 3LUD 16 4

21

Auto-tuning time (ms)

42.87311.36115.75524.13321.941210.876

8.8848.12423.98612.18314.343129.531

HSBP1BP2SR1SR2MMPF

NQUMCBT1BT2LUD

The resulting auto-tuning time is small.

Conclusions

GPUs have been widely used for general-purpose computation: Achieving high performance is not easy, one of the reasons is the

intricate on-chip memory resources;

Manually tuned code for one device may not perform well on a new device.

We propose compiler-driven automatic data placement as our solution: Our compiler algorithm refines GPU programs by altering data

placement to achieve both performance enhancement and performance portability;

We show that in different GPU devices, the kernels optimized with our compiler algorithm achieve significant performance improvement.

23

Backup

Effectiveness Breakdown

HS PF NQU MC0

1

2

3

4 Promote 1 smem arrayPromote 2 smem arraysPromote 3 smem arraysPromote 4 smem arrays

Per

form

ance

S

peed

up

HS BP1 BP2 SR1 SR2 MM BT1 BT2 LUD0.5

1

1.5

2

2.5C_Factor= 2 C_Factor= 4 C_Factor= 8 C_Factor= 16

Per

fom

ranc

e S

peed

up

19

Impact of Input Sizes (Marching-Cube)

8k 16k 32k 64k 128k 256k 512k1

1.1

1.2

1.3

1.4

1.5

Input Size (Voxels)

Spe

edup

Problem Input Size Impact: The optimized code generation for on-chip data placement is generally input agnostic;

Large input tends to show higher benefit.

22

Compiler Pass 1

Kernel shared_to_register_or_local_or_global (Kernel kernel) {

Kernel best_kernel = kernel;

float exe_time = eval(kernel); //collect the execution time of kernel;

/**Identification Stage**/

List arrays;

for (each shared memory array sma in kernel) {

sma.is_overlap = false; sma.is_index = false;

sma.access_count = 0; sma.size = allocation_size;

for (each access acc of array sma) {

sma.access_count += (acc in loop)?loop_count::1;

if (acc is overlapped across threads)

sam.is_overlap = true;

else if (the address of acc is calculated in the runtime)

sma.is_index = true ;}

if (sma.access_count >0) {arrays.add(sma);} } //end for

/**Processing Stage**/

/**Auto-tuning Stage**/

}

/**Identification Stage**/

/**Processing Stage**/

sma = array with largest access_count in arrays,

pop it out;

if (!sma.is_index and !sma.is_overlap)

replace sma with register file;

else if (sma.is_index and !sma.is_overlap)

replace sma with local memory;

else

replace sma with global memory;

/**Auto-tuning Stage**/

/**Identification Stage**/

/**Processing Stage**/

/**Auto-tuning Stage**/

generate a new kernel nkernel

exe_time1=eval(nkernel) //the execution time of

nkernel

if (exe_time1< exe_time) { // the new kernel is better

best_kernel = nkernel;

exe_time = exe_time ;}

else

return best_kernel; // found the best kernel }

//end while

Compiler Pass 2

Kernel shared_to_register_or_local_or_global (Kernel kernel) {

Kernel best_kernel = kernel;

float exe_time = eval(kernel); //collect the execution time of kernel;

/**Identification Stage**/

List exprs;

bool is_redundant_1d = false, is_redundant_2d = false;

for (each shared/global memory array sma in kernel) {

for (each access acc of array sma in expression expr) {

if (acc is independent of one thread dimension)

{ is_redundant_1d = true; exprs.add (expr);}

if (is_redundant_1d && acc is independent of the other

thread dimension in expression expr)

{is_redundant_2d = true; exprs.add (expr);} } }//end for

/**Processing Stage**/

/**Auto-tuning Stage**/

}

/**Identification Stage**/

/**Processing Stage**/

Adjust Thread Block Dimension.

if(is_redundant_1d) {

construct a one-loop with loop bound C_Factor to

perform the workload for compacted threads

convert expr in exprs to from inter-thread memory

usage into register array.

} else if(is_redundant_2d){

construct an 2-level loop with loop bound

C_Factor .x, and C_Factor . y to perform the workload

for compacted threads

convert expr in exprs to from inter-thread memory

usage into register array usage; }

/**Auto-tuning Stage**/

/**Identification Stage**/

/**Processing Stage**/

/**Auto-tuning Stage**/

generate a new kernel nkernel

exe_time1=eval(nkernel) //the execution time of

nkernel

if (exe_time1< exe_time) { // the new kernel is better

best_kernel = nkernel;

exe_time = exe_time ;}

else

return best_kernel; // found the best kernel }

//end for

our compiler algorithm focuses on code that has been reasonably optimized:

Manually or automatically by some compiler tools; Already employ classical loop optimizations such as tiling; Already allocate important data in shared memory either for

communications among threads or for data reuses.

The way of thread compaction can also be referred to as thread merge/coarsening. Compared to thread merge/coarsen/fusion, our approach specifically utilize this technique for register tiling. In other words, to utilize register reuse for eliminating the redundant usage of shared/global memory existed in GPU programs. We further focus on address how many threads to be compacted to maximize the register tiling while restrict the register pressure on TLP, thus to determine the most profitable version of data placement.