Making Hardware Accelerator Easier to Use

Invited talk at 14th Asian Symposium on Programming Languages and Systems (APLAS 2016)

Kazuaki Ishizaki

IBM Research Tokyo

Making Hardware Accelerator Easier to Use

1

Hanoi in 1996

My first visit to Hanoi

I joined our research project Java just-in-time compiler on 1996

I worked for Parallel Fortran compiler by 1995Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki2

Hanoi in 1996 and 2016

Drastically changed over twenty years

Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki3

1996 2016

Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

What has Happened in Computation-Intensive Area for 20 Years

Program is becoming simpler

Hardware is becoming complicated1996 2016

Hardware Fast scalar processors Commodity processors with

hardware accelerators

Applications Weather, wind, fluid, and physics

simulations, chemical synthesis

Machine learning and

deep learning with big data

Program Complicated and

hardware-dependent code

Simple and clean code

(e.g. mapreduce)

Users Limited to programmers

who are well-educated for HPC

Data scientists

who are non-familiar with hardware

Hardware

Examples4

GPUPowerPC

What has Happened in Computation-Intensive Area for 20 Years

Program is becoming simpler

Hardware is becoming complicated

Making Hardware Accelerator Easier to Use5

1996 2016

Hardware Fast scalar processors Commodity processors with

hardware accelerators

Applications Weather, wind, fluid, and physics

simulations, chemical synthesis

Machine learning and

deep learning with big data

Program Complicated and

hardware- dependent code

Simple and clean code

(MapReduce)

Users Limited to programmers

who are well-educated for HPC

Data scientists

who are non-familiar with hardware

Hardware

Examples

Bad news:Gap between hardware and software is bigger

Good news:Program can be easily analyzed

My Recent Interest

How system generates hardware accelerator code from

program with high-level abstractionExpected (practical) result

People execute program without knowing usage of hardware accelerator

Challenge How to optimize code for a certain hardware accelerator without specific

information

On-going research GPU exploitation from Java program

GPU exploitation in Apache Spark

work with Akihiro Hayashi *, Alon Shalev Housfater -, Hiroshi Inoue +,Madhusudanan Kandasamy , Gita Koblents -, Moriyoshi Ohara +,

Vivek Sarkar *, and Jan Wroblewski (intern) +

+ IBM Research Tokyo, - IBM Canada, IBM India, * Rice University


GPU Exploitation from Java Program

Why Java for GPU Programming?High productivity

Safety and flexibility (compare to C/C++)

Good program portability among different machines write once, run anywhere

One of the most popular programming languages Hard to use CUDA and OpenCL for non-expert programmers

Many computation-intensive applications in non-HPC area Data analytics and data science (Hadoop, Spark, etc.)

Security analysis

Natural language processing

8 Making Hardware Accelerator Easier to Use / Kazuaki IshizakiFrom https://www.flickr.com/photos/dlato/5530553658

CUDA is a programming language for GPU offered by NVIDIA


How We Write GPU Program Five steps

1. Allocate GPU device memory

2. Copy data on CPU main memory

to GPU device memory

3. Launch a GPU kernel to be executed

in parallel on cores

4. Copy back data on GPU device

memory to CPU main memory

5. Free GPU device memory

device memory

(up to 16GB)main memory

(up to 1TB/socket)

CPU GPU

Data copy over

PCIe or NVLink

dozen cores/socket thousands cores

9


How We Optimize GPU Program

device memory

(up to 16GB)main memory

(up to 1TB/socket)

CPU GPUdozen cores/socket thousands cores

10

Exploit faster memory Read-only cache (Read only)

Shared memory (SMEM)

Data copy over

PCIe or NVLink

From GTC presentation by NVIDIA

Reduce data copy

Five steps 1. Allocate GPU device memory

2. Copy data on CPU main memory

to GPU device memory

3. Launch a GPU kernel to be executed

in parallel on cores

4. Copy back data on GPU device

memory to CPU main memory

5. Free GPU device memory

Fewer Code Makes GPU Programming Easy

Current programming model requires programmers to

explicitly write operations for managing device memories

copying data

between CPU and GPU

expressing parallelism

exploiting faster memory

Java 8 enables programmers

to just focus on expressing parallelism

11 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

void fooCUDA(N, float *A, float *B, int N) {int sizeN = N * sizeof(float);cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN);cudaMemcpy(d_A, A, sizeN, Host2Device);GPU(d_A, d_B, N);cudaMemcpy(B, d_B, sizeN, Device2Host);cudaFree(d_B); cudaFree(d_A);

}// code for GPU__global__ void GPU(float* d_A, float* d_B, int N) {

int i = threadIdx.x;if (N {

B[i] = A[i] * 2.0;});

}

GoalBuild a Java just-in-time (JIT) compiler to generate high

performance GPU code from a parallel loop construct

Implementing four performance optimizations

Offering performance evaluations on POWER8 with a GPU

Supporting Java language feature (See [PACT2015])

Predicting Performance on CPU and GPU [PPPJ2015]

Available in IBM Java 8 ppc64le and x86_64 https://www.ibm.com/developerworks/java/jdk/java8/


Accomplishments

Parallel Programming in Java 8 Express parallelism by using Parallel Stream API among

iterations of a lambda expression (index variable: i)


class Example {void foo(float[] a, float[] b, float[] c, int n) {java.util.Stream.IntStream.range(0, n).parallel().forEach(i -> {b[i] = a[i] * 2.0;c[i] = a[i] * 3.0;

});}}

Note: Current version supports one-dimensional arrays with primitive types in a lambda expression

Overview of Our JIT Compiler Java bytecode

sequence is divided

into two intermediate

presentation (IR) parts Lambda expression:

generate GPU code

using NVIDIA tool chain

(right hand side)

Others:

generate CPU code

using conventional JIT

compiler (left hand side)


NVIDIA GPU binaryfor lambda expression

CPU binary for- managing device memory- copying data- launching GPU binary

Conventional Java JIT compiler

Parallel stream APIs detection

// Parallel stream codeIntStream.range(0, n).parallel()

.forEach(i -> { ...c[i] = a[i]...});

IR for GPUs...

c[i] = a[i]...

IR for CPUs

Java bytecode

CPU native code generator GPU native code

generator (by NVIDIA)

Additional modules for GPU

GPUs optimizations

Optimizations for GPU in Our JIT CompilerOptimizing alignment of Java arrays on GPUs

Reduce # of memory transactions to a GPU global memory

Using read-only cache Reduce # of memory transactions to a GPU global memory

Optimizing data copy between CPU and GPU Reduce amount of data copy

Eliminating redundant exception checks Reduce # of instructions in GPU binary


Reducing # of Memory Transactions to GPU Global Memory

Aligning the starting address of an array body in GPU global

memory with memory transaction boundary


0 128

a[0]-a[31]

Object header

Memory address

a[32]-a[63]Naivealignmentstrategy

a[0]-a[31] a[32]-a[63]

256 384

Ouralignmentstrategy

One memory transaction for a[0:31]

Two memory transactions for a[0:31]

IntStream.range(0,n).parallel().forEach(i->{

...= a[i]...; // a[] : float

...;});

a[64]-a[95]

a[64]-a[95]

A 128-byte memorytransaction boundary

Reducing # of Memory Transactions to GPU Global Memory

Must keep only a read-only array in a read-only cache Lexically different variables (e.g. a[] and b[]) may point to the same array

that may be updated

Perform alias analysis to identify a read-only array by Static analysis in JIT compiler identifies lexically read-only arrays and lexically written arrays

Dynamic alias analysis in generated code checks a lexically read-only array that may alias with any lexically written arrays

executes code with a read-only cache if not aliased

17 Compiling and Optimizing Java 8 Programs for GPU Execution

if (!(a[] aliases with b[]) && !(a[] aliases with c[])) {IntStream.range(0, n).parallel().forEach( i -> { b[i] = ROa[i] * 2.0; // use RO cache for a[]c[i] = ROa[i] * 3.0; // use RO cache for a[]

});} else {

// execute code w/o a read-only cache}

IntStream.range(0,n).parallel().forEach(i->{b[i] = a[i] * 2.0;c[i] = a[i] * 3.0;

});

Reducing Amount of Data Copy between CPU and GPU

Eliminate data copy from GPU if an array (e.g. a[]) is not

updated in GPU binary [Jablin11][Pai12]

Copy only a read or write set if an array index form is

i + constant (the set is contiguous)


sz = (n 0) * sizeof(float)cudaMemCopy(&a[0], d_a, sz, H2D); // copy only a read setcudaMemCopy(&b[0], d_b, sz, H2D);cudaMemCopy(&c[0], d_c, sz, H2D);IntStream.range(0, n).parallel().forEach( i -> {

b[i] = a[i]...;c[i] = a[i]...;

});cudaMemcpy(a, d_a, sz, D2H);cudaMemcpy(&b[0], d_b, sz, D2H); // copy only a write setcudaMemcpy(&c[0], c_b, sz, D2H); // copy only a write set

Eliminating Redundant Exception ChecksGenerate GPU code without exception checks by using

loop versioning [Artigas00] that guarantees safe region by using pre-

condition checks on CPU


if (// check cond. for NullPointerExceptiona != null && b != null && c != null &&// check cond. for ArrayIndexOutOfBoundsExceptiona.length

Automatically Optimized for CPU and GPUCPU code

handles GPU device memory management and data copying

checks whether optimized CPU and GPU code can be executed

GPU code

is optimized Using

read-only

cache

Eliminating

exception

checks


if (a != null && b != null && c != null &&a.length < n && b.length < n && c.length < n &&!(a[] aliases with b[]) && !(a[] aliases with c[])) {

cudaMalloc(d_a, a.length*sizeof(float)+128);if (b!=a) cudaMalloc(d_b, b.length*sizeof(float)+128);if (c!=a && c!=b) cudaMalloc(d_c, c.length*sizeof(float)+128);

int sz = (n 0) * sizeof(float), szh = sz + Jhdrsz;cudaMemCopy(a, d_a + align - Jhdrsz, szh, H2D);

GPU(d_a, d_b, d_c, n) // launch GPU

cudaMemcpy(b + Jhdrsz, d_b + align, sz, D2H);cudaMemcpy(c + Jhdrsz, d_c + align, sz, D2H);cudaFree(d_a); if (b!=a) cudaFree(d_b); if (c=!a && c!=b) cudaFree(d_c);

} else {// execute CPU binary

}

__global__ void GPU(float *a,float *b, float *c, int n) {// no exception checksi = ...b[i] = ROa[i] * 2.0;c[i] = ROa[i] * 3.0;

}

CPU

GPU

Benchmark ProgramsPrepare sequential and parallel stream API versions in Java


Name Summary Data size Type

Blackscholes Financial application that calculates the price of put and call

options

4,194,304 virtual

options

double

MM A standard dense matrix multiplication: C = A.B 1,024 x 1,024 double

Crypt Cryptographic application [Java Grande Benchmarks] N = 50,000,000 byte

Series the first N fourier coefficients of the function [Java Grande

Benchamark]

N = 1,000,000 double

SpMM Sparse matrix multiplication [Java Grande Benchmarks] N = 500,000 double

MRIQ 3D image benchmark for MRI [Parboil benchmarks] 64x64x64 float

Gemm Matrix multiplication: C = .A.B + .C [PolyBench] 1,024 x 1,024 int

Gesummv Scalar, vector, and Matrix multiplication [PolyBench] 4,096 x 4,096 int

Performance Improvements of GPU Version Over Sequential and Parallel CPU Versions

Achieve 127.9x on geomean and 2067.7x for Series over 1 CPU thread

Achieve 3.3x on geomean and 32.8x for Series over 160 CPU threads

Degrade performance for SpMM and Gesummv against 160 CPU threads


Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memorywith one NVIDIA Kepler K40m GPU at 876 MHz with 12-GB global memory (ECC off)Ubuntu 14.10, CUDA 5.5Modified IBM Java 8 runtime for PowerPC

0.85

0.45

1.51

0.920.74

0.11

1.19

3.47

0

0.5

1

1.5

2

2.5

3

3.5

4

BlackScholes MM Crypt Series SpMM MRIQ Gemm Gesummv

Sp

eed

up

rela

tive

to

CU

DA

Performance Comparison with Hand-Coded CUDA Achieve 0.83x on geomean over CUDA

Crypt, Gemm, and Gesummv: usage of a read-only cache

BlackScholes: usage of larger CUDA threads per block (1024 vs. 128)

SpMM: overhead of exception checks

MRIQ: miss of -use-fast-math compile option

MM: lack of usage of shared memory with loop tiling


Higher is better

GPU Version is slower Than Parallel CPU Version


Can we choose an appropriate device (CPU or GPU) to avoid

performance degradation? Want to make sure to achieve equal or better performance

Machine-learning-based Performance HeuristicsConstruct a binary prediction model offline by supervised

machine learning with support vector machines (SVMs) Features Loop range

Dynamic number of instructions (memory access, arithmetic operation, )

Dynamic number of array accesses (a[i], a[i + c], a[c * i], a[idx[i]])

Data transfer size (CPU to GPU, GPU to CPU)


data1Bytecode

App A feature 1Features

extraction

LIBSVM JavaRuntime

PredictionModel

data2Bytecode

App A feature 2Features

extraction

data3Bytecode

App B feature 3Features

extraction

CPU GPU

Most Predictions are CorrectUse 291 cases to build model

Succeeded in predicting cases of performance degradations on GPU

Failed to predict BlackScholes


Prediction

1.8->1.0 0.8->1.0 0.4->1.0

Related WorkOur research enables memory and communication

optimizations with machine-learning-based device selection


Work Language Exception

support

JIT

compiler

How to write GPU kernel Data copy

optimization

GPU memory

optimization

Device selection

JCUDA Java CUDA Manual Manual GPU only

JaBEE Java Override run method GPU only

Aparapi Java Override run

method/Lambda Static

Hadoop-CL Java Override map/reduce

method Static

Rootbeer Java Override run method Not described Not described

[PPPJ09] Java Java for-loop Not described Dynamic with

regression

HJ-OpenCLHabanero-

Java Forall constructs Static

Our work Java Standard parallel

stream API

ROCache /

alignment

Dynamic with

machine learning

Future work Exploiting shared memory

Like private memory shared by 64 - 192 cores

Supporting additional Java operations


GPU Exploitation in Apache Spark

What is Apache Spark? Framework that processes distributed computing by transforming

distributed immutable memory structure using set of parallel operations e.g. map(), filter(), reduce(),

Distributed immutable in-memory structures RDD (Resilient Distributed Dataset), DataFrame, Dataset

Scala is primary language for programming on Spark

Provide domain specific libraries


Spark Runtime (written in Java and Scala)

Spark

Streaming

(real-time)

GraphX

(graph)

SparkSQL

(SQL)

MLlib

(machine

learning)

Java Virtual Machine

tasks Executor

Driver

Executor

results

ExecutorData

Data

Data

Open source: http://spark.apache.org/

Data Source (HDFS, DB, File, etc.)

Latest version is 2.0.3 released in 2016/11

30

How Program Works on Apache Spark Parallel operations can be executed among partitions

In a partition, data can be processed sequentially


case class Pt(x: Int, y: Int)val ds1: Dataset[Pt] = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDSval ds2: Dataset[Pt] = ds1.map(p => Pt(p.x+1, p.y*2))val cnt: Int = ds2.reduce((p1, p2) => p1.x + p2.x)

ds1 ds2

p.x+1p.y*2

p1.x + p2.x

9

5

14partition

partition

cnt

54

32

+ =

+ =1 5

2 6

partition

pt

partition

31

2 10

3 12

3 7

4 8

4 14

5 16

How We Can Run Program Faster on GPU Assign many parallel computations into cores

Make memory accesses coalesce

Column-oriented layout results in better performance [Che2011] reports on about 3x performance improvement of GPU kernel execution of

kmeans with column-oriented layout over row-oriented layout

1 52 61 5 3 7

Assumption: 4 consecutive data elements

can be coalesced using GPU hardware

2 v.s. 4memory accesses to

GPU device memoryRow-oriented layoutColumn-oriented layout

Pt(x: Int, y: Int)Pt(1,5), Pt(2,6), Pt(3,7), Pt(4,8)Load four Pt.xLoad four Pt.y

2 6 4 843 87

coresx1 x2 x3 x4cores

Load Pt.x Load Pt.y Load Pt.x Load Pt.y

1 2 31 2 4

y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4


Idea to Transparently Exploit GPUs on Apache Spark

Generate GPU code from a set of parallel operations Made it in another research already

Physically put distributed immutable in-memory structures

(e.g. Dataset) in column-oriented representation Dataset is statically typed, but physical layout is not specified in program



Overview of GPU Exploitation on Apache Spark

Users Spark Program

case class Pt(x: Int, y: Int)ds1 = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDS

ds2 = ds1.map(p => Pt(p.x+1, p.y*2))cnt = ds2.reduce((p1, p2) => p1.x + p2.x)

Nativ

e c

od

e

GPU

10

12

14

+ 1

=

* 2 =

ds1

Datatransfer

x y x y

ds2

partitionGPU

kernel

CPU

16

2

3

4

5

10

12

14

16

2

3

4

5

5

6

1

2

7

8

3

4

5

6

1

2

7

8

3

4

34


Overview of GPU Exploitation on Apache Spark Efficient

Reduce data copy overhead between CPU and GPU

Make memory accesses efficient on GPU

Transparent Map parallelism in program

into GPU native code

Users Spark Program

case class Pt(x: Int, y: Int)ds1 = sc.parallelize(Seq(Pt(1, 4), Pt(2, 5), Pt(3, 6), Pt(4, 7)), 2).toDS

ds2 = ds1.map(p => Pt(p.x+1, p.y*2))cnt = ds2.reduce((p1, p2) => p1.x + p2.x)

Drive

GPU native

code

Nativ

e c

od

e

GPU

+ 1

=

* 2 =

ds1

Datatransfer

x y

GPU manager

Columnar storage

x y

GPU can exploit parallelism bothamong partitions in Dataset andwithin a partition of Dataset

ds2

partitionGPU

kernel

CPU

Mem

ory

ad

dre

ss

35

10

12

14

16

2

3

4

5

10

12

14

16

2

3

4

5

5

6

1

2

7

8

3

4

5

6

1

2

7

8

3

4


How We Write Program And What is Executed Write an operation using a lambda expression for RDD. Then, the corresponding Java

bytecode for the expression is executed.

Write a program using a relational operation for DataFrame or a lambda expression for

Dataset. Catalyst performs optimization and code generation for the program. Then, the

corresponding Java bytecode for the generated Java code is executed.

ds1 = data.toDS()ds2 = ds2.map(p => p.x+1)ds2.reduce((a,b) => a+b)

rdd1 = sc.parallelize(data)rdd2 = rdd1.map(p => p.x+1)rdd2.reduce((a,b) => a+b)

df1 = data.toDF()df2 = df2.selectExpr("x+1")df2.agg(sum())

Frontend

API

DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)

Backend

computationCatalyst-generated Java bytecodeScalac-generated Java bytecode

Java code

Catalyst

1 5 2 6

Java heap

2 61 5

Java heap

Row-oriented Row-oriented Data

data =Seq(Pt(1, 5),Pt(2, 6))

36

Our Two Implementations for GPU Exploitation GPUEnabler is designed for writing domain specific libraries by a Ninja

programmer Transparent exploitation by calling a method in the library

Enhanced Catalyst is designed for writing application by general

programmer Transparent exploitation by automatic code generation

Code / Columnar Storage DataFrame/Dataset RDD

Hand-written code GPUEnabler

Automatic code generation Enhanced Catalyst


GPU manager

Generated

GPU/SIMD

Pre-compiled

GPU/SIMD code

Spark user program

Columnar storage

Spark runtime

37


How Program is Executed on GPU For RDD, a programmer provides a pre-compiled GPU code. GPUEnabler handles

data transfer between GPU and CPU and launches the GPU code on GPU

For DataFrame and Dataset, enhanced Catalyst generates Java code optimized for

GPU. A just-in-time compiler in Java virtual machine can generate GPU code.

ds1 = data.toDS()ds2 = ds2.map(p => p.x+1)ds2.reduce((a,b) => a+b)

rdd1 = sc.parallelize(data)rdd2 = rdd1.map(p => p.x+1, gpu)rdd2.reduce((a,b) => a+b, gpu)

df1 = data.toDF()df2 = df2.selectExpr("x+1")df2.agg(sum())

Frontend

API

DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)

Backend

computationAutomatically generated GPU codePre-compiled GPU code

Optimized Java code

Enhanced Catalyst

Data2 61 5

GPU device memory

Column-oriented 2 61 5

GPU device memory

Column-oriented

data =Seq(Pt(1, 5),Pt(2, 6))

GPU Enabler

38

GPUEnabler (https://github.com/IBMSparkGPU/GPUEnabler) Use columnar storage for RDD

Support map & reduce operations to drive GPU code

Pass GPU code provided by programmer

to an argument of map()/reduce()

Implemented as a plug-in

# bin/spark-shell --class your.gpu.application yours.jar --packages com.ibm:gpu-enabler_2.10:1.0.0

// Load a kernel function from the GPU kernel binary val ptxURL = SparkGPULR.getClass.getResource("/GpuEnablerExamples.ptx")

val mapFunction = new CUDAFunction("multiply2", Array("this"), Array("this"),ptxURL)val reduceFunction = new CUDAFunction(sum, Array(this), Array(this), ptxURL)val rdd = sc.parallelize(1 to n)val output = rdd

.mapExtFunc((x: Int) => x * 2, mapFunction)

.reduceExtFunc((x: Int, y: Int) => x + y, reduceFunction)


// GPU code__global__ void multiply2(int *inX, int *outX, long *size) {

long ix = threadIdx.x + blockIdx.x * blockDim.x;if (*size

Pseudo Java Code by Current Catalyst Perform optimization that merges multiple parallel operations

(selectExpr() and agg(sum()) into one loop

int sum = 0while (rowIterator.hasNext()) {

Row row = rowIterator.next(); // for df1int x = row.getInteger(0);// selectExpr(x + 1)

int x_new = x + 1; // for df2sum += x_new;

}

val df1 = (-1 to 1).toDF("x")val df2 = df1.selectExpr("x + 1")df2.agg(sum())

Generated code corresponds to selectExpr() and local sum()

1

3

1

0

-1

-1 0

DataFrame program for Spark


20 1

Read sequentially

40

df1

x

x_new

sum

Row-oriented

Catalyst

Generated pseudo Java code

Pseudo Java Code by Enhanced Catalyst Get column0 from column-oriented storage

For-loop can be executed in a reduction manner

Column column0 = df1.getColumn(0); // df1int sum = 0;for (int i = 0; i < column0.numRows; i++) {

int x = column0.getInteger(i);// selectExpr(x + 1)

int x_new = x + 1; // for df2sum += x_new;

}

1

10-1

-1 0

Generated pseudo Java code


3

20 1

41

df1

x

x_new

sum

Column-orientedEnhanced Catalyst

Generate GPU Code Transparently from Spark Program

Copy column-oriented storage into GPU

Execute add and reduction in one GPU kernel

Column column0 = df1.getColumn(0);int nRows = column0.numRows;cudaMalloc(&d_c0, nRows*4);cudaMemcpy(d_c0, column0, nRows, H2D);int sum = 0;cudaMalloc(&d_sum, 4);cudaMemcpy(d_c0, &sum, 4, H2D); GPU(d_c0, d_sum, nRows) // launch GPUcudaMemcpy(d_c0, &sum, 4, D2H);cudaFree(d_sum); cudaFree(d_c0);


val df1 = (-1 to 1).toDF("x")val df2 = df1.selectExpr("x + 1")df2.agg(sum())

// GPU code__global__ void GPU(int *d_c0, int *d_sum, long size) {long ix = // 0, 1, 2if (size

Many Engineering Efforts are Required Make DataFrame and Dataset column-oriented storage

Generate simpler optimized code in the while-loop



Very Complicated Java Code by Current Catalyst

Overhead exists in Java code Data representation

Data conversions

Complicated code

// source programval x = Array(1.0, 2.0), y = Array(3.0, 4.0)val ds = sparkContext.parallelize(Seq(x, y), 1).toDSds.map(a => a)

44

a. sparse array to

java.lang.Double[]

b. java.lang.Double[] to double[]

c. double[] to java.lang.Double[]

d. java.lang.Double[] to sparse array

Pretty Simple Java Code by Enhanced Catalyst Eliminated most of data conversions are eliminated

Use data representations suitable for GPU

Dense array to double[]

double[] to dense array

// source programval x = Array(1.0, 2.0), y = Array(3.0, 4.0)val ds = sparkContext.parallelize(Seq(x, y), 1).toDSds.map(a => a)


Related Work SparkJNI (https://github.com/tudorv91/SparkJNI)

Call native method from map() or reduce()

Very similar to GPUEnabler. However, use no columnar storage

Spark With Accelerated Tasks [Grossman2016] Generate GPU code from lambda function in map() in RDD

Very similar to enhanced Catalyst using columnar storage to transparently

exploit GPUs. However, work for RDD with map()

GPU Columnar (proposed by Kiran Lonikar) Generate GPU code from program using select() method in DataFrame

Very similar to enhanced Catalyst using columnar storage to transparently

exploit GPUsMaking Hardware Accelerator Easier to Use / Kazuaki Ishizaki

val inputRDD = cl(sc.objectFile[Int]( hdfsPath ))val doubledRDD = inputRDD.map(i => 2 * i)

JavaRDD vectorsRdd = getSparkContext().parallelize(generateVectors(2, 4));JavaRDD mulResults = vectorsRdd.map(new VectorMulJni(libPath, "mapVectorMul"));VectorBean results = mulResults.reduce(new VectorAddJni(libPath, "reduceVectorAdd"));

46

https://github.com/tudorv91/SparkJNI

ConclusionWe generated hardware accelerator code from program with

high-level abstraction

It is not easy to make them in systematic wayHow can we easily generate optimized code from different types of

domain specific languages? Program is cleaner and simpler than twenty-years ago.

How can we integrate good results in theory into practical system?

What can we do similar things for deep learning?Current deep learning frameworks use GPU by calling libraries (e.g.

cnDNN/cuRNN)

What are future programming models for deep learning?


Making Hardware Accelerator Easier to Use

Software

Transcript of Making Hardware Accelerator Easier to Use