University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU...

40
Nessie: A NESL to CUDA Compiler John Reppy University of Chicago October 2018

Transcript of University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU...

Page 1: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie: A NESL to CUDA Compiler

John Reppy

University of Chicago

October 2018

Page 2: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

GPU background

GPUs

I GPU architectures are optimized for arithmetically intensive computation.I GPUs provide super-computer levels of parallelism at commodity prices.I For example, the Tesla V100 provides 15.7 TFlops peak single-precision performance and 7.8

TFlops of peak double-precision performance.

NVidia GPUs have two (or three) levels of parallelism:I A multicore processor that supports Single-Instruction Multiple-Thread (SIMT) parallelism.I Multiple multicore processors on a single chip.I Multiple GPGPU boards per system.

October 2018 Shonan 136 — Nessie 2

Page 3: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

GPU background

GPUs

For example, Nvidia’s Kepler GK110 StreamingMultiprocessor (SMX).I 192 single-precision cores ⌅I 64 double-precision cores ⌅I 32 load/store units ⌅I 32 special function units ⌅I 4 ⇥ 32-lane warps in parallel

Lots of parallel compute, but not very much memory

Register File (64K × 32-bit)

Interconnect Network64 KB L1 Cache/Shared Memory

48 KB Read-only Data Cache

Dispatch

Warp Scheduler

Instruction Cache

October 2018 Shonan 136 — Nessie 3

Page 4: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

GPU background

GPUs (continued ...)NVIDIA’s Tesla K40 architecture has 15 GK110 SMXs (2880 Cuda cores).

1536 KB L2 Cache

Mem

ory

Con

trolle

rThread Engine

Optimized for processing data in bulk!

October 2018 Shonan 136 — Nessie 4

Page 5: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

GPU background

GPU programming modelThe design of GPU hardware is manifest in the widely used GPU programming languages (e.g.,Cuda and OpenCL).

Thread hierarchy

I Threads (grouped into warps for SIMTexecution)

I Blocks (mapped to the same SMX)I Grid (multiple blocks running the same

kernel)Synchronization

I Block-level barriersI Atomic memory operationsI Task synchronization

Explicit memory hierarchy

I Disjoint memory spacesI Per-thread memory maps to registersI Per-block shared memoryI Global memoryI Host memoryI Also texture and constant memory

October 2018 Shonan 136 — Nessie 5

Page 6: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

GPU background

Programming becomes harder!

C code for dot product (map-reduce):float dotp (int n, const float *a, const float *b){

float sum = 0.0f ;for (int i = 0; i < n; i++)

sum += a[i] * b[i] ;return sum ;

}

Also need CPU-side code!cudaMalloc ((void **)&V1_D , N*sizeof(float)) ;cudaMalloc ((void **)&V2_D , N*sizeof(float)) ;cudaMalloc ((void **)&V3_D , blockPerGrid*sizeof(float)) ;

cudaMemcpy (V1_D , V1_H , N*sizeof(float), cudaMemcpyHostToDevice);cudaMemcpy (V2_D , V2_H , N*sizeof(float), cudaMemcpyHostToDevice);

dotp <<<blockPerGrid, ThreadPerBlock>>> (N, V1_D, V2_D, V3_D);

V3_H = new float [blockPerGrid] ;cudaMemcpy (V3_H, V3_D, N*sizeof(float), cudaMemcpyDeviceToHost);

float sum = 0 ;for (int i = 0 ; i<blockPerGrid ; i++)

sum += V3_H[i] ;

delete V3_H;

CUDA device code for dot product:__global__ void dotp (int n, const float *a, const float *b, float *results){

__shared__ float cache[ThreadsPerBlock] ;float temp ;const unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x ;const unsigned int idx = threadIdx.x ;

while (tid < n) {temp += a[tid] * b[tid] ;tid += blockDim.x * gridDim.x ;

}cache[idx] = temp ;

__synchthreads () ;

int i = blockDim.x / 2 ;while (i != 0) {

if (idx < i)cache[idx] += cache[idx + i] ;__synchthreads () ;

i /= 2 ;}if (idx == 0)

results[blockIdx.x] = cache[0] ;}

October 2018 Shonan 136 — Nessie 6

Page 7: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nested Data Parallelism

NESL

I NESL is a first-order functional language for parallel programming over sequences designedby Guy Blelloch [Blelloch ’96].

I Provides parallel for-each operation (with optional filter){ x + y : x in xs; y in ys }

{ x / y : x in xs; y in ys | (y /= 0) }I Provides other parallel operations on sequences, such as reductions, prefix-scans, and

permutations.function dot (xs, ys) = sum ({ x * y : x in xs; y in ys })

I Supports Nested Data Parallelism (NDP) — components of a parallel computation maythemselves be parallel.

October 2018 Shonan 136 — Nessie 7

Page 8: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nested Data Parallelism

NDP example: sparse matrix times dense vector2

66664

1 0 4 0 00 3 0 0 2

0 0 0 5 06 7 0 0 8

0 0 9 0 0

3

77775

2

66664

12345

3

77775

Want to avoid computing products wherematrix entries are 0.

Sparse representation tracks non-zeroentries using sequence of sequences ofindex-value pairs:

(0, 1) (2, 4)

(1, 3) (4, 2)

(3, 5)

(0, 6) (1, 7) (4, 8)

(0, 1)

October 2018 Shonan 136 — Nessie 8

Page 9: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nested Data Parallelism

NDP example: sparse-matrix times vectorIn NESL, this algorithm has a compact expression:

function svxv (sv, v) = sum ({ x * v[i] : (i, x) in sv })

function smxv (sm, v) = { svxv (sv, v) : sv in sm }

Notice that the smxv function is a map of map-reduce subcomputations; i.e., nested dataparallelism.

October 2018 Shonan 136 — Nessie 9

Page 10: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nested Data Parallelism

NDP example: sparse-matrix times vectorNaive parallel decomposition will be unbalanced because of irregularity in sub-problem sizes.

(0, 1) (2, 4)

(1, 3) (4, 2)

(3, 5)

(0, 6) (1, 7) (4, 8)

(0, 1)

0 2 1 4 3 0 1 4 0

1 4 3 2 5 6 7 8 1

2 2 1 3 1

Flattening transformation converts NDPto flat DP (including AoS to SoA)

October 2018 Shonan 136 — Nessie 10

Page 11: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nested Data Parallelism

FlatteningFlattening (a.k.a. vectorization) is a global program transformation that converts irregular nesteddata parallel code into regular flat data parallel code.

I Lifts scalar operations to work on sequences of valuesI Flattens nested sequences paired with segment descriptorsI Conditionals are encoded as dataI Residual program contains vector operations plus sequential control flow and

recursion/iteration.

October 2018 Shonan 136 — Nessie 11

Page 12: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nested Data Parallelism

Flattening function calls

{ f (e) : x in xs }

=)

if #xs = 0then [ ]else let es = { e : x in xs }in f "(es)

October 2018 Shonan 136 — Nessie 12

Page 13: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nested Data Parallelism

Lifting functionsIf we have

function f (x) = e

then f " is defined to befunction f " (xs) = { e : x in xs }

October 2018 Shonan 136 — Nessie 13

Page 14: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nested Data Parallelism

Flattening conditionals

{if b then e1 else e2 : x in xs }

=)

let fs = { b : x in xs }let (xs1, xs2) = PARTITION(xs, fs)let vs1 = { e1 : x in xs1 }let vs2 = { e2 : x in xs2 }in COMBINE(vs1, vs2, fs)

October 2018 Shonan 136 — Nessie 14

Page 15: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nested Data Parallelism

Flattening example: factorial

function fact (n) = if (n <= 0) then 1 else n * fact (n - 1)

function fact� (ns) = { fact(n) : n in ns }<latexit sha1_base64="R5iNvw7FDDCBmDKaM3mXhO+4fOM=">AAAG3HicfVRbjxs1FJ6UEkq4beGRF6vpSl1kVplVV9CKSEsrLg8UFrq7LZqZLh7PmcRajz2yPd0NZt54Q7zyo/gH/BWeOM4kZJMURvHMyfF3Lj4+58trKawbjf7q3Xjt5uv9N269OXjr7XfefW/n9vtnVjeGwynXUpvnObMghYJTJ5yE57UBVuUSnuUXj8P+s5dgrNDqxM1qyCo2UaIUnDlUne/8meYwEcpL60I0oSZtIpmaNG wC42+/ePoNrZibguWshrEzDWSDslE8GJOScUfuqT0yJqJEgXw2JqM9gnBFYgLSAlHkoyWMfEziPbKyvptyXcD8BGnFCx9g7QufNjUzRl+2d9HGBt9+7iHEeYj+hCLKknaQgiquZ707ON8ZjvZH84dsC/FCGB5F3XN8fvvm32mheVOBclwya5N4VLvMM+MEl4AxGgs14xdYiwRFxSqwmZ+n3JJd1BSk1AaXcmSuvW7hnUA4OnkK7gnW8HNZT1kOzqehos61XmlTMdn6707i1tdgUBN+/2ORa1lcx+dXncG1qKyydlblmF8wspt7JGhftZk0rvw080LVjQPFu/OVjSROk9A4pBAGuJMzFBg3AktE+JQZvBpsr7UwXYEGaQElduX8n/8BMHEzyVs/2n9A8VbC2sB8ZQDUv6iAuP8K1CPZwBoorAcboCeiUGIydVvAww3gcWNqufJ3uIDd3wqKZ9tytpnajyAldm4Hw4ajYWELHmxGfTxjq4MeHHa4eO5PwSXXVcWwu1cD0gZ57VwbyIvLddy8SFveqnVQuJP15lkMk/1PLQ6cxYYcEIKkI7h1Mwnj1LmSVULOUouNUTsrfoZV7hTBFzC71KZYwBfJpnlpwQiwARGyxDFcIJappsLZKTJPF7mr4JKdkswHfmrxi8NPiK+0gWWgsccjU6ZmNNda0tCotGCOhU6mgZloSb+kpdTM0SUjUVFSgW/laPBHiARHpZ4gV0qqgj8cWKqaKgdDdfgVQuGWdYFwwweLRAP70TD61NETeoXl7rxp3DCr9NIh/YU+pKmnaUvThKZZG8pAkGUsztZL6KzCmTrH46TI7gRdmyy90GXRFrFttrt+40irbRJnPnVw5bq3c34Yt3jtyJbxJjduC2cH+zHK3x8Mjx4tePNW9GF0J7oXxdEn0VH0dXQcnUa8N+qd9c57P/Vf9H/t/9b/vYPe6C1sPojWnv4f/wAr8Vt0</latexit><latexit sha1_base64="R5iNvw7FDDCBmDKaM3mXhO+4fOM=">AAAG3HicfVRbjxs1FJ6UEkq4beGRF6vpSl1kVplVV9CKSEsrLg8UFrq7LZqZLh7PmcRajz2yPd0NZt54Q7zyo/gH/BWeOM4kZJMURvHMyfF3Lj4+58trKawbjf7q3Xjt5uv9N269OXjr7XfefW/n9vtnVjeGwynXUpvnObMghYJTJ5yE57UBVuUSnuUXj8P+s5dgrNDqxM1qyCo2UaIUnDlUne/8meYwEcpL60I0oSZtIpmaNG wC42+/ePoNrZibguWshrEzDWSDslE8GJOScUfuqT0yJqJEgXw2JqM9gnBFYgLSAlHkoyWMfEziPbKyvptyXcD8BGnFCx9g7QufNjUzRl+2d9HGBt9+7iHEeYj+hCLKknaQgiquZ707ON8ZjvZH84dsC/FCGB5F3XN8fvvm32mheVOBclwya5N4VLvMM+MEl4AxGgs14xdYiwRFxSqwmZ+n3JJd1BSk1AaXcmSuvW7hnUA4OnkK7gnW8HNZT1kOzqehos61XmlTMdn6707i1tdgUBN+/2ORa1lcx+dXncG1qKyydlblmF8wspt7JGhftZk0rvw080LVjQPFu/OVjSROk9A4pBAGuJMzFBg3AktE+JQZvBpsr7UwXYEGaQElduX8n/8BMHEzyVs/2n9A8VbC2sB8ZQDUv6iAuP8K1CPZwBoorAcboCeiUGIydVvAww3gcWNqufJ3uIDd3wqKZ9tytpnajyAldm4Hw4ajYWELHmxGfTxjq4MeHHa4eO5PwSXXVcWwu1cD0gZ57VwbyIvLddy8SFveqnVQuJP15lkMk/1PLQ6cxYYcEIKkI7h1Mwnj1LmSVULOUouNUTsrfoZV7hTBFzC71KZYwBfJpnlpwQiwARGyxDFcIJappsLZKTJPF7mr4JKdkswHfmrxi8NPiK+0gWWgsccjU6ZmNNda0tCotGCOhU6mgZloSb+kpdTM0SUjUVFSgW/laPBHiARHpZ4gV0qqgj8cWKqaKgdDdfgVQuGWdYFwwweLRAP70TD61NETeoXl7rxp3DCr9NIh/YU+pKmnaUvThKZZG8pAkGUsztZL6KzCmTrH46TI7gRdmyy90GXRFrFttrt+40irbRJnPnVw5bq3c34Yt3jtyJbxJjduC2cH+zHK3x8Mjx4tePNW9GF0J7oXxdEn0VH0dXQcnUa8N+qd9c57P/Vf9H/t/9b/vYPe6C1sPojWnv4f/wAr8Vt0</latexit><latexit sha1_base64="R5iNvw7FDDCBmDKaM3mXhO+4fOM=">AAAG3HicfVRbjxs1FJ6UEkq4beGRF6vpSl1kVplVV9CKSEsrLg8UFrq7LZqZLh7PmcRajz2yPd0NZt54Q7zyo/gH/BWeOM4kZJMURvHMyfF3Lj4+58trKawbjf7q3Xjt5uv9N269OXjr7XfefW/n9vtnVjeGwynXUpvnObMghYJTJ5yE57UBVuUSnuUXj8P+s5dgrNDqxM1qyCo2UaIUnDlUne/8meYwEcpL60I0oSZtIpmaNG wC42+/ePoNrZibguWshrEzDWSDslE8GJOScUfuqT0yJqJEgXw2JqM9gnBFYgLSAlHkoyWMfEziPbKyvptyXcD8BGnFCx9g7QufNjUzRl+2d9HGBt9+7iHEeYj+hCLKknaQgiquZ707ON8ZjvZH84dsC/FCGB5F3XN8fvvm32mheVOBclwya5N4VLvMM+MEl4AxGgs14xdYiwRFxSqwmZ+n3JJd1BSk1AaXcmSuvW7hnUA4OnkK7gnW8HNZT1kOzqehos61XmlTMdn6707i1tdgUBN+/2ORa1lcx+dXncG1qKyydlblmF8wspt7JGhftZk0rvw080LVjQPFu/OVjSROk9A4pBAGuJMzFBg3AktE+JQZvBpsr7UwXYEGaQElduX8n/8BMHEzyVs/2n9A8VbC2sB8ZQDUv6iAuP8K1CPZwBoorAcboCeiUGIydVvAww3gcWNqufJ3uIDd3wqKZ9tytpnajyAldm4Hw4ajYWELHmxGfTxjq4MeHHa4eO5PwSXXVcWwu1cD0gZ57VwbyIvLddy8SFveqnVQuJP15lkMk/1PLQ6cxYYcEIKkI7h1Mwnj1LmSVULOUouNUTsrfoZV7hTBFzC71KZYwBfJpnlpwQiwARGyxDFcIJappsLZKTJPF7mr4JKdkswHfmrxi8NPiK+0gWWgsccjU6ZmNNda0tCotGCOhU6mgZloSb+kpdTM0SUjUVFSgW/laPBHiARHpZ4gV0qqgj8cWKqaKgdDdfgVQuGWdYFwwweLRAP70TD61NETeoXl7rxp3DCr9NIh/YU+pKmnaUvThKZZG8pAkGUsztZL6KzCmTrH46TI7gRdmyy90GXRFrFttrt+40irbRJnPnVw5bq3c34Yt3jtyJbxJjduC2cH+zHK3x8Mjx4tePNW9GF0J7oXxdEn0VH0dXQcnUa8N+qd9c57P/Vf9H/t/9b/vYPe6C1sPojWnv4f/wAr8Vt0</latexit><latexit sha1_base64="s2pPH+iZiVYVOAG8kYw3puqQqC4=">AAAG3HicfVRbjxs1FJ6UEkq4beGRF6vpSl1kVplVV72ISKUVlwcKC93dFmWmi8dzJrHWY49sT3eDmTfeEK/8KP4Bf4UnjjMTskkKo3jm5Pg7Fx+f82WVFNaNRn/1rr1x/c3+WzfeHrzz7nvvf7Bz88NTq2vD4YRrqc2LjFmQQsGJE07Ci8oAKzMJz7PzJ2H/+SswVmh17OYVpCWbKlEIzhyqznb+TDKYCuWldSGaUNNmIpma1m wK42+/ePYNLZmbgeWsgrEzNaSDolY8GJOCcUfuqD0yJqJAgXw2JqM9gnBFYgLSAlHkkyWMfEriPbKyvp1wncPiBEnJcx9gzUuf1BUzRl80t9HGBt9+4SHEeYj+hCLKkmaQgMqvZr07ONsZjvZHi4dsC3EnDKPuOTq7ef3vJNe8LkE5Lpm1k3hUudQz4wSXgDFqCxXj51iLCYqKlWBTv0i5IbuoyUmhDS7lyEJ71cI7gXB08gzcU6zh57KasQycT0JFnWu80qZksvHfHceNr8CgJvz+xyLTMr+Kzy5bgytRWWntvMwwv2BkN/dI0L5uc1K74n7qhapqB4q35ytqSZwmoXFILgxwJ+coMG4ElojwGTN4Ndhea2HaAg2SHArsysU//wNg4maaNX60/4DirYS1gfnKAKh/UQFx9zWox7KGNVBYDzZAT0WuxHTmtoCHG8Cj2lRy5e+wg93dCopn23K2mdqPICV2bgvDhqNhYQsebEZ9Mmergx4ctrh44U/BBddlybC7VwPSBHntXBvI84t13KJIW97KdVC4k/Xm6YbJ/qcWB85iQw4IQdIR3Lq5hHHiXMFKIeeJxcaonBU/wyp3iuBzmF9ok3fwLtkkKywYATYgQpY4hh1imWoinJ0h87SR2wou2WmS+sBPDX5x+AnxpTawDDT2eGTK1JxmWksaGpXmzLHQyTQwEy3ol7SQmjm6ZCQqCirwrRwN/giR4KjUU+RKSVXwhwNLVV1mYKgOv1wo3LIuEG74YJFoYD8aRp86ekwvsdytN40bZpVeMqS/0Ic08TRpaDKhSdqEMhBkGYuz9Qpaq3Cm1vF4kqe3gq6ZLL3QZdG62DbdXb9xpNVmEqc+cXDp2rdzfhg3eO3IlvEmN24Lpwf7McrfHwwfPe5480b0cXQruhPF0b3oUfR1dBSdRLw36p32zno/9V/2f+3/1v+9hV7rdTYfRWtP/49/AMqTWzQ=</latexit>

=)

function fact� (ns) =

let fs = (ns <=� dist(0, #ns));(ns1, ns2) = PARTITION(ns, fs);vs1 = dist(1, #ns1);vs2 = if (#ns2 = 0)

then []else let

es = (ns2 -� dist(1, #ns2));rs = fact� (es);

in (ns2 *� rs);

in COMBINE(vs1, vs2, fs)<latexit sha1_base64="LBSg372pXg9lNyQRUj6MhYlpNWM=">AAAIMnicjVVbb9xEFHYXyrbLLSmPvIy6jZSgIVqvGtECK6WJCkQiF9qkLVqbaGwf744yHlsz4ySL8d/gFX4EfwbeEK/8CM7Y3uwtIEY79uw537nOOcdBJrg2vd7vd1pvvX33nfa9+51333v/gw/X1h+80mmuQjgLU5GqNwHTILiEM8ONgDeZApYEAl4HF/uW//oSlOapPDWTDPyEjSSPecgMks7XW/e9AEZcFkIba47LUTkUTI5yNoLB0fOX39KEmTHokGUwMCoHvxPnMrTS5JEXphFUTnhJGBUxC035Q+HlGVMqvSofkU2pt8igQ4gAQ2JNBpayIvflYFEqQkc2e5R0UXrrC5S2CwVdSqTuo0Jy8uzF6cHpwfERUikqvkFdahfZlQK3UuDOsfrI4jHZRLI99rYajl0YoyRDf44CQoP1e46ExCaG/koQn94SQ+1CfxZEvZRV8j+SB3pJkMvbbX+yKKhmciixf3y4d3D0fPPS5g+TUOWr44GM5i99o3O+1u1t96pFVg9uc+juOvU6OV+/u+ZFaZgnIE0omNZDt5cZv2DK8FBA2fFyDRkLL7CUhniULAHtF5XbJdlASkTiVOGWhlTUeYnCcISjkpdgDrEEn4lszAIwhWcL0piykKlKmCiL41O3LDJQSLG//5AIUhHN44PrWmDOKku0niQB+meF9DKPWOptzGFu4id+wWWWG5BhHV+cC2JSYhsPC0JBaMQEDyxUHFNEwjFTeOvYngtm6gR1vAhi7OrqX/EC0HE1Csqit/2U4q3YvYT5WgHIG5RFPL4FtSdyWADZ/XQJdMgjyUdjswLcWQKe5CoTM307DezxilGMbUXZsmvfgxBYwDUMC47ajSXYX7a6P2GzQPs7Nc6t9Em4CtMkYVjdsyYp7XkhriXkxdUirkrSirZkEWTvZLF4mmbS/0rFhtNYkNicOLR5qM1EwMAzJmYJFxNPY2FkRvMfYeY7RfAFTK5SFTXwxlkviDUoDtoirJfYhg1i6qrHjR7j4K4t1xmcDvehX9jxXuIbm5+QIkkVTA0NCgyZMjmhQZoKaguVRswwW8nUDkYa069oLFJm6PRzQHlMOT6loRvV9MHZSUU6wm+NoNLqw4alMk8CUDS1v4hLZGljP1j2hUmidg5T2/rU0FN6jTirrJlh9Gbul7WNFOFq5rTXpT/Rz6lXUK+k3pB6fmmTQ3D2aOy4S6ilbKS1ucEw8h9aWjmcaqHTVDYeaX9jsQ5w4JZD1y88A9emfhpTdN0SiwFnqLs8MVcPr/rbLp6/63d395ppes/52HnobDqu85mz63zjnDhnTtjKWj+3fmn92v6t/Uf7z/ZfNbR1p5H5yFlY7b//AS0ouqs=</latexit><latexit sha1_base64="LBSg372pXg9lNyQRUj6MhYlpNWM=">AAAIMnicjVVbb9xEFHYXyrbLLSmPvIy6jZSgIVqvGtECK6WJCkQiF9qkLVqbaGwf744yHlsz4ySL8d/gFX4EfwbeEK/8CM7Y3uwtIEY79uw537nOOcdBJrg2vd7vd1pvvX33nfa9+51333v/gw/X1h+80mmuQjgLU5GqNwHTILiEM8ONgDeZApYEAl4HF/uW//oSlOapPDWTDPyEjSSPecgMks7XW/e9AEZcFkIba47LUTkUTI5yNoLB0fOX39KEmTHokGUwMCoHvxPnMrTS5JEXphFUTnhJGBUxC035Q+HlGVMqvSofkU2pt8igQ4gAQ2JNBpayIvflYFEqQkc2e5R0UXrrC5S2CwVdSqTuo0Jy8uzF6cHpwfERUikqvkFdahfZlQK3UuDOsfrI4jHZRLI99rYajl0YoyRDf44CQoP1e46ExCaG/koQn94SQ+1CfxZEvZRV8j+SB3pJkMvbbX+yKKhmciixf3y4d3D0fPPS5g+TUOWr44GM5i99o3O+1u1t96pFVg9uc+juOvU6OV+/u+ZFaZgnIE0omNZDt5cZv2DK8FBA2fFyDRkLL7CUhniULAHtF5XbJdlASkTiVOGWhlTUeYnCcISjkpdgDrEEn4lszAIwhWcL0piykKlKmCiL41O3LDJQSLG//5AIUhHN44PrWmDOKku0niQB+meF9DKPWOptzGFu4id+wWWWG5BhHV+cC2JSYhsPC0JBaMQEDyxUHFNEwjFTeOvYngtm6gR1vAhi7OrqX/EC0HE1Csqit/2U4q3YvYT5WgHIG5RFPL4FtSdyWADZ/XQJdMgjyUdjswLcWQKe5CoTM307DezxilGMbUXZsmvfgxBYwDUMC47ajSXYX7a6P2GzQPs7Nc6t9Em4CtMkYVjdsyYp7XkhriXkxdUirkrSirZkEWTvZLF4mmbS/0rFhtNYkNicOLR5qM1EwMAzJmYJFxNPY2FkRvMfYeY7RfAFTK5SFTXwxlkviDUoDtoirJfYhg1i6qrHjR7j4K4t1xmcDvehX9jxXuIbm5+QIkkVTA0NCgyZMjmhQZoKaguVRswwW8nUDkYa069oLFJm6PRzQHlMOT6loRvV9MHZSUU6wm+NoNLqw4alMk8CUDS1v4hLZGljP1j2hUmidg5T2/rU0FN6jTirrJlh9Gbul7WNFOFq5rTXpT/Rz6lXUK+k3pB6fmmTQ3D2aOy4S6ilbKS1ucEw8h9aWjmcaqHTVDYeaX9jsQ5w4JZD1y88A9emfhpTdN0SiwFnqLs8MVcPr/rbLp6/63d395ppes/52HnobDqu85mz63zjnDhnTtjKWj+3fmn92v6t/Uf7z/ZfNbR1p5H5yFlY7b//AS0ouqs=</latexit><latexit sha1_base64="LBSg372pXg9lNyQRUj6MhYlpNWM=">AAAIMnicjVVbb9xEFHYXyrbLLSmPvIy6jZSgIVqvGtECK6WJCkQiF9qkLVqbaGwf744yHlsz4ySL8d/gFX4EfwbeEK/8CM7Y3uwtIEY79uw537nOOcdBJrg2vd7vd1pvvX33nfa9+51333v/gw/X1h+80mmuQjgLU5GqNwHTILiEM8ONgDeZApYEAl4HF/uW//oSlOapPDWTDPyEjSSPecgMks7XW/e9AEZcFkIba47LUTkUTI5yNoLB0fOX39KEmTHokGUwMCoHvxPnMrTS5JEXphFUTnhJGBUxC035Q+HlGVMqvSofkU2pt8igQ4gAQ2JNBpayIvflYFEqQkc2e5R0UXrrC5S2CwVdSqTuo0Jy8uzF6cHpwfERUikqvkFdahfZlQK3UuDOsfrI4jHZRLI99rYajl0YoyRDf44CQoP1e46ExCaG/koQn94SQ+1CfxZEvZRV8j+SB3pJkMvbbX+yKKhmciixf3y4d3D0fPPS5g+TUOWr44GM5i99o3O+1u1t96pFVg9uc+juOvU6OV+/u+ZFaZgnIE0omNZDt5cZv2DK8FBA2fFyDRkLL7CUhniULAHtF5XbJdlASkTiVOGWhlTUeYnCcISjkpdgDrEEn4lszAIwhWcL0piykKlKmCiL41O3LDJQSLG//5AIUhHN44PrWmDOKku0niQB+meF9DKPWOptzGFu4id+wWWWG5BhHV+cC2JSYhsPC0JBaMQEDyxUHFNEwjFTeOvYngtm6gR1vAhi7OrqX/EC0HE1Csqit/2U4q3YvYT5WgHIG5RFPL4FtSdyWADZ/XQJdMgjyUdjswLcWQKe5CoTM307DezxilGMbUXZsmvfgxBYwDUMC47ajSXYX7a6P2GzQPs7Nc6t9Em4CtMkYVjdsyYp7XkhriXkxdUirkrSirZkEWTvZLF4mmbS/0rFhtNYkNicOLR5qM1EwMAzJmYJFxNPY2FkRvMfYeY7RfAFTK5SFTXwxlkviDUoDtoirJfYhg1i6qrHjR7j4K4t1xmcDvehX9jxXuIbm5+QIkkVTA0NCgyZMjmhQZoKaguVRswwW8nUDkYa069oLFJm6PRzQHlMOT6loRvV9MHZSUU6wm+NoNLqw4alMk8CUDS1v4hLZGljP1j2hUmidg5T2/rU0FN6jTirrJlh9Gbul7WNFOFq5rTXpT/Rz6lXUK+k3pB6fmmTQ3D2aOy4S6ilbKS1ucEw8h9aWjmcaqHTVDYeaX9jsQ5w4JZD1y88A9emfhpTdN0SiwFnqLs8MVcPr/rbLp6/63d395ppes/52HnobDqu85mz63zjnDhnTtjKWj+3fmn92v6t/Uf7z/ZfNbR1p5H5yFlY7b//AS0ouqs=</latexit><latexit sha1_base64="u/rFeFe7pXTkErQ2GZRWnX6Mkww=">AAAIMnicjVVbb9xEFHYXyrbLLYFHXkbdRtqgIVqvGkGBldpEBSKRC23SFq1NNLaPd0cZj62ZcZLF+G/wCj+CPwNviFd+BGdsb/YWEKMde/ac71znnOMgE1ybfv/3O6033rz7Vvve/c7b77z73vsbmx+81GmuQjgLU5Gq1wHTILiEM8ONgNeZApYEAl4FF/uW/+oSlOapPDXTDPyEjSWPecgMks43W/e9AMZcFkIba47LcTkSTI5zNobh0bMX39KEmQnokGUwNCoHvxPnMrTS5KEXphFUTnhJGBUxC035Q+HlGVMqvSofkp7U22TYIUSAIbEmQ0tZk/tyuCwVoSO9PiVdlN7+AqXtQkGXEqkHqJCcPH1+enB6cHyEVIqKb1CX2kV2pcCtFLgLrAGyeEx6SLbH/nbDsQtjlGTkL1BAaLB+L5CQ2MQwWAvik1tiqF0YzIOol7JK/kfyQK8Icnm77Y+XBdVcDiX2jw/3Do6e9S5t/jAJVb46Hsho8dK3Oucb3f5Ov1pk/eA2h67TrJPzzbsbXpSGeQLShIJpPXL7mfELpgwPBZQdL9eQsfACS2mER8kS0H5RuV2SLaREJE4VbmlIRV2UKAxHOCp5AeYQS/CpyCYsAFN4tiCNKQuZqoSJsjg+dcsiA4UU+/sPiSAV0SI+uK4FFqyyROtpEqB/Vkiv8oil3sYc5Sb+zC+4zHIDMqzji3NBTEps42FBKAiNmOKBhYpjikg4YQpvHdtzyUydoI4XQYxdXf0rngM6rsZBWfR3HlO8FbtXMF8rAHmDsohHt6D2RA5LILsfr4AOeST5eGLWgLsrwJNcZWKub7eBPVozirGtKVt17XsQAgu4hmHBUbuxBAerVvenbB7oYLfGuZU+CVdhmiQMq3veJKU9L8W1gry4WsZVSVrTliyD7J0sF0/TTPpfqdhwGgsSmxOHNg+1mQoYesbELOFi6mksjMxo/iPMfacIvoDpVaqiBt446wWxBsVBW4T1EtuwQcxc9bjRExzcteU6g7PhPvILO95LfGPzE1IkqYKZoWGBIVMmpzRIU0FtodKIGWYrmdrBSGP6FY1FygydfQ4ojynHpzR0q5o+ODupSMf4rRFUWn3YsFTmSQCKpvYXcYksbewHy74wSdTOYWpbnxp6Sq8RZ5U1M4zezP2ytpEiXM2d9rr0J/o59QrqldQbUc8vbXIIzh6NHXcJtZSNtDY3HEX+A0srRzMtdJbKxiPtby3XAQ7ccuT6hWfg2tRPY4quW2Ix4Ax1Vyfm+uHlYMfF83eD7pO9Zprecz5yHjg9x3U+dZ443zgnzpkTtrLWz61fWr+2f2v/0f6z/VcNbd1pZD50llb7738Axkq6aw==</latexit>

October 2018 Shonan 136 — Nessie 15

Page 16: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

NESL on GPUs

NESL on GPUs

I NESL was designed for bulk-data processing on wide-vector machines (SIMD)I Potentially a good fit for GPU computationI First try [Bergstrom & Reppy ’12] demonstrated feasibility of NESL on GPUs, but was

signficantly slower than hand-tuned CUDA code for some benchmarks (worst case: over 50times slower on Barnes-Hut [Burtscher & Pingali ’11]).

October 2018 Shonan 136 — Nessie 16

Page 17: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

NESL on GPUs

Areas for improvementWe identified a number of areas for improvement.I Better fusion:

I Fuse generators, scans, and reductions with maps.I “Horizontal fusion,” (fuse independent maps over the same index space).

I Better segment descriptor management.I Better memory management.

It proved difficult/impossible to support these improvements in the context of the VCODEinterpreter.

October 2018 Shonan 136 — Nessie 17

Page 18: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

NessieNew NESL compiler built from scratch.

I Front-end produces monomorphic, direct-style IR.I Flattening eliminates NDP and produces Flan, which is a flat-vector language with

VCODE-like operators.I Shape analysis is used to tag vectors with size information (symbolic in some cases).I Flan is converted to �cu, which is where fusion and other optimizations occur.

October 2018 Shonan 136 — Nessie 18

Page 19: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

�cu — An IR for GPU programs (continued ...)�cu is a three-level language:I CPU expressions – direct-style extended �-calculus with kernel dispatchI Kernels – sequences of second-order array combinators (SOAC)I GPU anonymous functions – first-order functions that are the arguments to the SOACs.

October 2018 Shonan 136 — Nessie 19

Page 20: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

�cu — An IR for GPU programs (continued ...)

CPU expressions

prog ::= kern dcl

dcl ::= function f ( params ) blk dcl::= let params = exp dcl::= exp

params ::= xi:�i

blk ::= { bind exp }bind ::= let params = exp

exp ::= blk| run K args| f args| if exp then blk else blk| exp � exp| · · ·

Kernel expressions

kern ::= kernel K xs { bind return ys }bind ::= let xs = SOAC arg

arg ::= ⇤| rop⇤| shape

GPU expressions

⇤ ::= { xs => exp using ys }exp ::= if exp then exp else exp

| let x = exp in exp| exp � exp| xs[ i]| x| · · ·

October 2018 Shonan 136 — Nessie 20

Page 21: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Second-Order Array CombinatorsLike Futhark [Henriksen et al. ’14], Nova [Collins et al. ’14], and other systems, we useSecond-Order Array Combinators (SOACs) to represent the iteration structure of our operations onsequences.

ONCE : (unit ) ⌧) ! ⌧

MAP : (int ) ⌧) int ! ⌧"

PERMUTE : (int ) ⌧) (int ) int) int ! ⌧"

REDUCE : (int ) ⌧) rop⌧ int ! ⌧

SCAN : (int ) ⌧) rop⌧ int ! ⌧"

FILTER : (int ) ⌧) (⌧ ) bool) int ! ⌧"

PARTITION : (int ) ⌧) (⌧ ) bool) int ! ⌧" ⇥ ⌧"

SEG_PERMUTE : (int ) ⌧) (int ) int) sd ! ⌧"

SEG_REDUCE : (int ) ⌧) rop⌧ sd ! ⌧"

SEG_SCAN : (int ) ⌧) rop⌧ sd ! ⌧"

SEG_FILTER : (int ) ⌧) (⌧ ) bool) sd ! ⌧" ⇥ sd

SEG_PARTITION : (int ) ⌧) (⌧ ) bool) sd ! ⌧" ⇥ sd ⇥ ⌧" ⇥ sd

October 2018 Shonan 136 — Nessie 21

Page 22: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Second-Order Array Combinators (continued ...)There is a key difference between our combinators and previous work: combinators use a pullthunk, which is parameterized by the array indices, to get their inputs

Example: iota

{ i : i in [0 : n-1] }

=)kernel K (n : int) -> [int] {

let res = MAP { i => i } nreturn res

}

October 2018 Shonan 136 — Nessie 22

Page 23: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Second-Order Array Combinators (continued ...)There is a key difference between our combinators and previous work: combinators use a pullthunk, which is parameterized by the array indices, to get their inputs

Example: the element-wise product of two sequences

{ x * y : x in xs; y in ys }

=)kernel K (xs : [float], ys : [float]) -> [float] {

let res = MAP { i => xs[i] * ys[i] using xs, ys } (#xs)return res

}

October 2018 Shonan 136 — Nessie 22

Page 24: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Second-Order Array Combinators (continued ...)There is a key difference between our combinators and previous work: combinators use a pullthunk, which is parameterized by the array indices, to get their inputs

Example: summing a sequence

sum (xs)

=)kernel K (xs : [float]) -> float {

let res =REDUCE { i => xs[i] using xs } (FADD) (#xs)

return res}

October 2018 Shonan 136 — Nessie 22

Page 25: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Nessie backend

Flan �cu �cu �cu CUDAtranslate fusion memory analysis

code generation

lp_solve

I Designed to support better fusion, etc..I Backend transforms flattened code to CUDA in several steps.

I ILP-based fusion [Megiddo and Sarkar ’99; Robinson et al ’14].I Memory analysis based on Uniqueness types [de Vries et al ’07].I Add explicit memory management based on analysis.

October 2018 Shonan 136 — Nessie 23

Page 26: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Simple map-reduce fusion

The �cu code for the dotp example is

kernel prod (xs : [float], ys : [float]) -> [float] {let res = MAP { i => xs[i] * ys[i] using xs, ys } (#xs)return res

}kernel sum (xs : [float]) -> float {

let res = REDUCE { i => xs[i] using xs } (FADD) (#xs)return res

}

function dots (xs : [float], ys : [float]) -> [float] {let t1 : [float] = run prod (xs, ys)let t2 : float = run sum (t)return t2

}

October 2018 Shonan 136 — Nessie 24

Page 27: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Simple map-reduce fusion

Step 1: Fuse the two kernels into a combined kernel.

October 2018 Shonan 136 — Nessie 24

Page 28: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Simple map-reduce fusion

Step 1: Fuse the two kernels into a combined kernel.

kernel F (xs : [float], ys : [float]) -> float {let ts = MAP { i => xs[i] * ys[i] using xs, ys } (#xs)let res = REDUCE { i => ts[i] using ts } (FADD) (#ts)return res

}

function dots (xs : [float], ys : [float]) -> [float] {let t2 : float = run F (xs, ys)return t2

}

October 2018 Shonan 136 — Nessie 24

Page 29: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Simple map-reduce fusion

Step 2: Fuse the MAP operation into the REDUCE’s pull operation

October 2018 Shonan 136 — Nessie 24

Page 30: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Simple map-reduce fusion

Step 2: Fuse the MAP operation into the REDUCE’s pull operation

kernel F (xs : [float], ys : [float]) -> float {let res = REDUCE { i => xs[i] * ys[i] using xs, ys } (FADD) (#xs)return res

}

function dots (xs : [float], ys : [float]) -> [float] {let t2 : float = run F (xs, ys)return t2

}

October 2018 Shonan 136 — Nessie 24

Page 31: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Fancier fusionConsider the following Nesl function (adapted from [Robinson et al ’14]):function norm2 (xs) : [float] -> ([float], [float]) =

let sum1 = sum(xs);gts = { x : x in xs | (x > 0) };sum2 = sum(gts);

in({ x / sum1 : x in xs }, { x / sum2 : x in xs })

October 2018 Shonan 136 — Nessie 25

Page 32: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Fancier fusion (continued ...)

Translating to �cu produces the following code:kernel K1 (xs : [float]) -> float {let res = REDUCE { i => xs[i] using xs } (FADD) (#xs)return res

}kernel K2 (xs : [float]) -> [float] {let res = FILTER { i => xs[i] using xs } { x => x > 0 } (#xs)return res

}kernel K3 (xs : [float], s : float) -> [float] {let res = MAP { i => xs[i] / s using xs } (#xs)return res

}

function norm2 (xs : [float]) -> ([float], [float]) {let sum1 : float = run K1 (xs)let its : [float] = run K2 (xs)let sum2 = run K1 (its)let res1 : [float] = run K3 (xs, sum1)let res2 : [float] = run K3 (xs, sum2)return (res1, res2)

}

October 2018 Shonan 136 — Nessie 26

Page 33: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Fancier fusion (continued ...)

PDG control region

xs

run K2 (xs)run K1 (xs)

run K1 (gts)

run K3 (xs, sum2)run K3 (xs, sum1)

(res1,res2)

sum1

sum2

gts

res2res1

xs xs

xs xs

October 2018 Shonan 136 — Nessie 26

Page 34: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Fancier fusion (continued ...)

One possible schedule

xs

run K2 (xs)run K1 (xs)

run K1 (gts)

run K3 (xs, sum2)run K3 (xs, sum1)

(res1,res2)

October 2018 Shonan 136 — Nessie 26

Page 35: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Fancier fusion (continued ...)

Another possible schedule

xs

run K2 (xs)run K1 (xs)

run K1 (gts)

run K3 (xs, sum2)run K3 (xs, sum1)

(res1,res2)

October 2018 Shonan 136 — Nessie 26

Page 36: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Fancier fusion (continued ...)

kernel F1 (xs : [float]) -> (float, float) {let (sum1, sum2) =

REDUCE { i => let x = xs[i] in (x, if x > 0 then x else 0) using xs }(FADD, FADD)(#xs)

return (sum1, sum2)}kernel F2 (xs : [float], sum1 : float, sum2 : float) -> [float] {let (res1, res2) =

MAP { i => let x = xs[i] in (x / sum1, x / sum2) using xs, sum1, sum2 }(#xs)

return (res1, res2)}

function norm2 (xs : [float]) -> ([float], [float]) {let (sum1 : float, sum2) = run F1 (xs)let (res1 : [float], res2 : [float]) = run F2 (xs, sum1, sum2)return (res1, res2)

}

Using ILP produces the followingschedule:

KF1

KF2

xs

run K2 (xs)run K1 (xs)

run K1 (gts)

run K3 (xs, sum2)run K3 (xs, sum1)

(res1,res2)

October 2018 Shonan 136 — Nessie 26

Page 37: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Nessie

Fancier fusion (continued ...)

Notice how we fused the FILTER into the REDUCE!

Using ILP produces the followingschedule:

KF1

KF2

xs

run K2 (xs)run K1 (xs)

run K1 (gts)

run K3 (xs, sum2)run K3 (xs, sum1)

(res1,res2)

October 2018 Shonan 136 — Nessie 26

Page 38: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Streaming and piecewise execution

Streaming and piecewise execution

I �cu processes vectors as atomic objects, which can exceed the memory resources of a GPU.I We could partition kernel execution into smaller pieces (either statically or dynamically) to

improve scalability enable multi-GPU parallelism.I Palmer et al. describe a post-flattening piecewise execution strategy and there was some

follow-on work by Pfannestiel about scheduling piecewise execution for threaded execution.

October 2018 Shonan 136 — Nessie 27

Page 39: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Streaming and piecewise execution

Streaming and piecewise execution

I �cu processes vectors as atomic objects, which can exceed the memory resources of a GPU.I We could partition kernel execution into smaller pieces (either statically or dynamically) to

improve scalability enable multi-GPU parallelism.I Palmer et al. describe a post-flattening piecewise execution strategy and there was some

follow-on work by Pfannestiel about scheduling piecewise execution for threaded execution.

(0, 1) (2, 4)

(1, 3) (4, 2)

(3, 5)

(0, 6) (1, 7) (4, 8)

(0, 1)

Flat execution

Piecewise execution

October 2018 Shonan 136 — Nessie 27

Page 40: University of Chicago - NII Shonan Meeting · 2019-03-31 · GPU background GPUs I GPU architectures are optimized for arithmetically intensive computation. I GPUs provide super-computer

Streaming and piecewise execution

Streaming and piecewise execution (continued ...)

I Connections to Keller and Chakravarty’s Distributed Types and Palmer et al.’s Piecewiseexecution of NDP programs.

I Not all operations can be executed in piecewise fashion (e.g., permutations).I The execution model for Madsen and Filinski’s Streaming NESL also requires piecewise

execution of kernels.

October 2018 Shonan 136 — Nessie 28