Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

Runtime CodeGeneration

DataManagement

Results

Conclusion

Enabling Heterogeneous Computing in Javawith Graal

Juan Fumero, Michel Steuwer, Christophe Dubach

The University of Edinburgh

7 July 2015Truffle Workshop

1 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

1 Introduction

3 Runtime Code Generation

4 Data Management

5 Results

6 Conclusion

2 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Heterogeneous Computing

NBody App (NVIDIA SDK) ˜105x speedup over seqLU Decomposition (Rodinia Benchmark) ˜10x over 32OpenMP threads

3 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Cool, but how to program?

4 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Example in OpenCL1 // create host buffers2 i n t ∗A, . . . .3 //Initialization4 . . .5 // platform6 c l u i n t numPlatforms = 0 ;7 c l p l a t f o r m i d ∗p l a t f o r m s ;8 s t a t u s = c l G e t P l a t f o r m I D s ( 0 , NULL , &numPlatforms ) ;9 p l a t f o r m s = ( c l p l a t f o r m i d ∗) m a l l o c ( numPlatforms∗ s i z e o f ( c l p l a t f o r m i d ) ) ;

10 s t a t u s = c l G e t P l a t f o r m I D s ( numPlatforms , p l a t f o r m s , NULL) ;11 c l u i n t numDevices = 0 ;12 c l d e v i c e i d ∗ d e v i c e s ;13 s t a t u s = c l G e t D e v i c e I D s ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , 0 , NULL , &

numDevices ) ;14 // Allocate space for each device15 d e v i c e s = ( c l d e v i c e i d ∗) m a l l o c ( numDevices∗ s i z e o f ( c l d e v i c e i d ) ) ;16 // Fill in devices17 s t a t u s = c l G e t D e v i c e I D s ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , numDevices ,

d e v i c e s , NULL) ;18 c l c o n t e x t c o n t e x t ;19 c o n t e x t = c l C r e a t e C o n t e x t (NULL , numDevices , d e v i c e s , NULL , NULL , &s t a t u s ) ;20 cl command queue cmdQ ;21 cmdQ = clCreateCommandQueue ( c o n t e x t , d e v i c e s [ 0 ] , 0 , &s t a t u s ) ;22 cl mem d A , d B , d C ;23 d A = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ ONLY|CL MEM COPY HOST PTR ,

d a t a s i z e , A, &s t a t u s ) ;24 d B = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ ONLY|CL MEM COPY HOST PTR ,

d a t a s i z e , B, &s t a t u s ) ;25 d C = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM WRITE ONLY , d a t a s i z e , NULL , &s t a t u s ) ;26 . . .27 // Check errors28 . . .

5 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Example in OpenCL

1 const char ∗ s o u r c e F i l e = ” k e r n e l . c l ” ;2 s o u r c e = r e a d s o u r c e ( s o u r c e F i l e ) ;3 program = c l C r e a t e P r o g r a m W i t h S o u r c e ( c o n t e x t , 1 , ( const char∗∗)&s o u r c e , NULL ,

&s t a t u s ) ;4 c l i n t b u i l d E r r ;5 b u i l d E r r = c l B u i l d P r o g r a m ( program , numDevices , d e v i c e s , NULL , NULL , NULL) ;6 // Create a kernel7 k e r n e l = c l C r e a t e K e r n e l ( program , ” vecadd ” , &s t a t u s ) ;89 s t a t u s = c l S e t K e r n e l A r g ( k e r n e l , 0 , s i z e o f ( cl mem ) , &d A ) ;

10 s t a t u s |= c l S e t K e r n e l A r g ( k e r n e l , 1 , s i z e o f ( cl mem ) , &d B ) ;11 s t a t u s |= c l S e t K e r n e l A r g ( k e r n e l , 2 , s i z e o f ( cl mem ) , &d C ) ;1213 s i z e t g l o b a l W o r k S i z e [ 1 ] = { ELEMENTS} ;14 s i z e t l o c a l i t e m s i z e [ 1 ] = {5} ;1516 clEnqueueNDRangeKernel (cmdQ , k e r n e l , 1 , NULL , g l o b a l W o r k S i z e , NULL , 0 , NULL ,

NULL) ;1718 c l E n q u e u e R e a d B u f f e r (cmdQ , d C , CL TRUE , 0 , d a t a s i z e , C , 0 , NULL , NULL) ;1920 // Free memory

6 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

OpenCL example

1 k e r n e l vo idvecadd (

2 g l o b a l i n t ∗a ,3 g l o b a l i n t ∗b ,4 g l o b a l i n t ∗c ) {5

6 i n t i d x =7 g e t g l o b a l i d ( 0 ) ;8 c [ i d x ] = a [ i d x ] ∗

b [ i d x ] ;9 }

• Hello world App ˜ 250 lines ofcode (including errorchecking)

• Low-level and specific code

• Knowledge about targetarchitecture

• If GPU/accelerator changes,tuning is required

7 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

OpenCL programming is hard and error-prone!!

8 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Higher levels of abstraction

9 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Higher levels of abstraction

10 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Similar works

• Sumatra API (discontinued): Stream API for HSAIL

• AMD Aparapi: Java API for OpenCL

• NVIDIA Nova: functional programming language forCPU/GPU

• Cooperhead: subset of python than can be executed onheterogeneous platforms.

11 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Our Approach

Three levels of abstraction:

• Parallel Skeletons: API based on functional programmingstyle (map/reduce)

• High-level optimising library which rewrites operations totarget specific hardware

• OpenCL code generation and runtime with datamanagement for heterogeneous architecture

12 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Our approachOverview

Application +ArrayFunction API

Java Bytecode

(using Graal API)

OpenCL Kernel Generation

OpenCL Execution

Java source compilation

Java executiondotP.apply(input)

Accelerator

OpenCL Kernel

output

13 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Example: Saxpy

1 // Computation function2 ArrayFunc<Tuple2<F l o a t , F l o a t >, F l o a t> mult = new

MapFunction<>(t −> 2 . 5 f ∗ t . 1 ( ) + t . 2 ( ) ) ;3

4 // Prepare the input5 Tuple2<F l o a t , F l o a t > [ ] i n p u t = new Tuple2 [ s i z e ] ;6 f o r ( i n t i = 0 ; i < i n p u t . l e n g t h ; ++i ) {7 i n p u t [ i ] . 1 = ( f l o a t ) ( i ∗ 0 . 3 2 3 ) ;8 i n p u t [ i ] . 2 = ( f l o a t ) ( i + 2 . 0 ) ;9 }

11 // Computation12 F l o a t [ ] output = mult . a p p l y ( i n p u t ) ;

If accelerator enabled, the map expression is rewritten in lowerlevel operations automatically.map(λ) = MapAccelerator(λ) =CopyIn().computeOCL(λ).CopyOut()

14 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Our ApproachOverview

Ar r ayFunc

MapThr eads

MapOpenCL

Reduce. . .appl y( ) { f or ( i = 0; i < s i ze; ++i ) out [ i ] = f . appl y( i n[ i ] ) ) ;}

appl y( ) { f or ( t hr ead : t hr eads) t hr ead. per f or mMapSeq( ) ;}

appl y( ) { copyToDevi ce( ) ; execut e( ) ; copyToHost ( ) ;}

Funct i on

15 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Runtime Code GenerationWorkflow

...10: aload_211: iload_312: aload_013: getfield16: aaload18: invokeinterface#apply23: aastore24: iinc27: iload_3...

Java sourceMap.apply(f)

Java bytecode

Graal VM

CFG + Dataflow(Graal IR)

void kernel ( global float* input, global float* output) { ...; ...;} OpenCL Kernel

3. optimizations

2. IR generation

4. kernel generation

1. Type inference

16 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

OpenCL code generated1 double lambda0 ( f l o a t p0 ) {2 double c a s t 1 = ( double ) p0 ;3 double r e s u l t 2 = c a s t 1 ∗ 2 . 0 ;4 r e t u r n r e s u l t 2 ;5 }6 k e r n e l vo id l ambdaComputat ionKerne l (7 g l o b a l f l o a t ∗ p0 ,8 g l o b a l i n t ∗ p 0 i n d e x d a t a ,9 g l o b a l double ∗p1 ,

10 g l o b a l i n t ∗ p 1 i n d e x d a t a ) {11 i n t p0 d im 1 = 0 ; i n t p1 d im 1 = 0 ;12 i n t gs = g e t g l o b a l s i z e ( 0 ) ;13 i n t l o o p 1 = g e t g l o b a l i d ( 0 ) ;14 f o r ( ; ; l o o p 1 += gs ) {15 i n t p 0 l e n d i m 1 = p 0 i n d e x d a t a [ p0 d im 1 ] ;16 b o o l cond 2 = l o o p 1 < p 0 l e n d i m 1 ;17 i f ( cond 2 ) {18 f l o a t auxVar0 = p0 [ l o o p 1 ] ;19 double r e s = lambd0 ( auxVar0 ) ;20 p1 [ p 1 i n d e x d a t a [ p1 d im 1 + 1 ] + l o o p 1 ]21 = r e s ;22 } e l s e { break ; }23 }24 }

17 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Investigation of runtime for BS

Black-scholes benchmark.Float[] =⇒ Tuple2 < Float,Float > []

ount of to

tal ru

e in %

Unmarshaling

CopyToCPU

GPU Execution

CopyToGPU

Marshaling

Java overhead

• Un/marshal data takesup to 90% of the time

• Computation stepshould be dominant

This is not acceptable. Can we do better?

18 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Custom Array Type

Programmer's View

Tuple2

Graal-OCL VM

float float float float...

double double double double...

FloatBuffer

DoubleBuffer

0 1 2 n-1

double

Tuple2

double

Tuple2

double

Tuple2

double

PArray<Tuple2<Float,Double>>

With this layout, un/marshal operations are not necessary

19 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Example of JPAI

1 ArrayFunc<Tuple2<F l o a t , Double >, Double> f = newMapFunction<>(t −> 2 . 5 f ∗ t . 1 ( ) + t . 2 ( ) ) ;

3 PArray<Tuple2<F l o a t , Double>> i n p u t = new PArray<>( s i z e ) ;4

5 f o r ( i n t i = 0 ; i < s i z e ; ++i ) {6 i n p u t . put ( i , new Tuple2 <>(( f l o a t ) i , ( double ) i + 2) ) ;7 }8

9 PArray<Double> output = f . a p p l y ( i n p u t ) ;

20 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

• 5 Applications

• Comparison with:• Java Sequential - Graal

compiled code• AMD and Nvidia GPUs• Java Array vs. Custom

PArray• Java threads

21 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Java Threads Execution

small large

Saxpysmall large

K−Means

small large

Black−Scholes

small large

N−Bodysmall large

Monte Carlo

Speedup v

s. Java

sequential

Number of Java Threads

#1 #2 #4 #8 #16

CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

22 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

OpenCL GPU Execution

small large

0.004 0.004small large

K−Meanssmall large

Black−Scholessmall large

N−Bodysmall large

Monte Carlo

Speedup v

s. Java

sequential

Nvidia Marshalling Nvidia Optimized AMD Marshalling AMD Optimized

AMD Radeon R9 295NVIDIA Geforce GTX Titan Black

23 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

OpenCL GPU Execution

small largeSaxpy

0.004 0.004small large

K−Meanssmall large

Black−Scholessmall large

N−Bodysmall large

Monte Carlo

Nvidia Marshalling Nvidia Optimized AMD Marshalling AMD Optimized

10x12x 70x

AMD Radeon R9 295NVIDIA Geforce GTX Titan Black

24 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

.zip(Conclusions).map(Future)

Present

• We have presented an API to enable heterogeneouscomputing in Java

• Custom array type to reduce overheads when transfer thedata

• Runtime system to run heterogeneous applications withinJava

Future

• Runtime data type specialization

• Code generation for multiple devices

• Runtime scheduling (Where is the best place to run thecode?)

25 / 26

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

DataManagement

Results

Conclusion

Thanks so much for your attention

This work was supported bya grant from:

Juan Jose Fumerojuan.fumero@ed.ac.uk

26 / 26

Gpus graal

Engineering

Transcript of Gpus graal

Visualizing Graal

AOF Wave front sensor modules GALACSI and GRAAL by Stefan Ströbele in behalf of the GALACSI and GRAAL Team members: R.Arsenault, R.Conzelmann, B.Delabre,

Polyglot on the JVM with Graal (English)

Message du Graal

Graal - JKUlafo.ssw.uni-linz.ac.at/papers/2017_PLDI_GraalTutorial.pdf · – Using Graal for static analysis – Custom compilations with Graal: integration of the compiler with an

Revista O Graal nº 14

Uwe Schwiegelshohn EPIT 2007, June 5 Ordonnancement - GRAAL

Truffle/Graal - cs.cmu.edu

Graal : going deeper in the JVM using Javawiki.jvmlangsummit.com/images/4/4e/2012-07-30_GraalJVMSummit2012... · Graal APIs meta –Lazy Reflection (read only) –Profiling code –native

1 - A Conspiracao Do Graal - Lynn Sholes

Polyglot on the JVM with Graal - JUG · Key Features of Graal • Designed for speculative optimizations and deoptimization – Metadata for deoptimizationis propagated through all

Revista O Graal nº 07

In Quest of the Holy Graal

Revista O Graal nº 03

Revista O Graal nº 15

Fastparallelevaluationofexactgeometricpredicateson GPUs

Revista O Graal nº 08

Revista O Graal nº 16

General Review of GRAAL physics : Achievements and Future

Revista O Graal nº 04