Post on 05-Apr-2018
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 1/103
Peter Calvert
Parallelisation of Java for
Graphics Processors
Computer Science Tripos, Part II
Trinity College
May 11, 2010
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 2/103
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 3/103
Proforma
Name: Peter Calvert
College: Trinity College
Project Title: Parallelisation of Java for Graphics Processors
Examination:Computer Science Tripos, Part II, June 2010Word Count: 11983 words
Project Originator: Peter Calvert
Supervisors: Dr Andrew Rice and Dominic Orchard
Original Aims of the Project
The aim of the project was to allow extraction and compilation of Java vir-
tual machine bytecode for parallel execution on graphics cards, specifically the
NVIDIA CUDA framework, by both explicit and automatic means.
Work Completed
The compiler, which was produced, successfully extracts and compiles code from
class files into CUDA C++ code, and outputs transformed classes that make use
of this native code. Developers can indicate loops that should be parallelised by
use of Java annotations. Loops can also be automatically detected as ‘safe’ using
a dependency checking algorithm.
On benchmarks, speedups of up to a factor of 187 were measured. Evaluation
of the automatic dependency analysis showed 85% accuracy over a range of samplecode.
Special Difficulties
None.
i
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 4/103
Declaration
I, Peter Calvert of Trinity College, being a candidate for Part II of the Computer
Science Tripos, hereby declare that this dissertation and the work described in it
are my own work, unaided except as may be specified below, and that the disser-
tation does not contain material that has already been used to any substantialextent for a comparable purpose.
Signed
Date
ii
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 5/103
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Project Description . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 JavaB [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Within JikesRVM [16] . . . . . . . . . . . . . . . . . . . . 3
1.3.3 JCUDA [25] . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Preparation 5
2.1 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Development Process . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Methods of Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Development Environment . . . . . . . . . . . . . . . . . . . . . . 8
2.5 The Java Platform . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.1 State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 NVIDIA CUDA Architecture . . . . . . . . . . . . . . . . . . . . 11
2.6.1 Thread Model . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6.2 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Common Compiler Analysis Techniques . . . . . . . . . . . . . . . 14
2.7.1 General Dataflow Analysis . . . . . . . . . . . . . . . . . . 14
2.7.2 Loop Detection . . . . . . . . . . . . . . . . . . . . . . . . 152.7.3 Live Variable Analysis . . . . . . . . . . . . . . . . . . . . 16
2.7.4 Constant Propagation . . . . . . . . . . . . . . . . . . . . 17
2.7.5 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . 17
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Implementation 19
3.1 Overall Implementation Structure . . . . . . . . . . . . . . . . . . 19
iii
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 6/103
3.2 Internal Code Representation (ICR) . . . . . . . . . . . . . . . . . 21
3.2.1 Code Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Visitor Pattern . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Bytecode to ICR Translation . . . . . . . . . . . . . . . . 24
3.2.4 Type Inference . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Dataflow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Support for Arrays and Objects . . . . . . . . . . . . . . . 28
3.3.2 Increment Variables . . . . . . . . . . . . . . . . . . . . . . 28
3.3.3 May-Alias . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.4 Usage Information . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Loop Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Loop Trivialisation . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Kernel Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.5.1 Copy In . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5.2 Copy Out . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Dependency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.1 Annotation Based . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.2 Automatic . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7.1 C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7.2 Kernel Invocation . . . . . . . . . . . . . . . . . . . . . . . 38
3.7.3 Data Copying . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 Compiler Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.8.1 Feedback to the User . . . . . . . . . . . . . . . . . . . . . 41
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Evaluation 43
4.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Model of Overheads . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Component Benchmarks . . . . . . . . . . . . . . . . . . . 49
4.2.3 Java Grande Benchmark Suite [7] . . . . . . . . . . . . . . 51
4.2.4 Mandelbrot Set Computation . . . . . . . . . . . . . . . . 52
4.2.5 Conway’s Game of Life . . . . . . . . . . . . . . . . . . . . 53
4.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Accuracy of Dependency Analysis . . . . . . . . . . . . . . . . . . 56
4.4 Comparison with Existing Work . . . . . . . . . . . . . . . . . . . 56
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
iv
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 7/103
5 Conclusions 59
5.1 Comparison with Requirements . . . . . . . . . . . . . . . . . . . 59
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.1 Further Hardware Support . . . . . . . . . . . . . . . . . . 60
5.2.2 Further Optimisations . . . . . . . . . . . . . . . . . . . . 60
5.2.3 Further Automatic Detection . . . . . . . . . . . . . . . . 61
5.3 Final Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Bibliography 63
A Dataflow Convergence Proofs 67
A.1 General Dataflow Analysis . . . . . . . . . . . . . . . . . . . . . . 67
A.2 Live Variable Analysis . . . . . . . . . . . . . . . . . . . . . . . . 68
A.3 Constant Propagation . . . . . . . . . . . . . . . . . . . . . . . . 69
B Code Generation Details 71
C Command Line Interface 73
D Sample Code Used 75
D.1 Java Grande Benchmark Suite . . . . . . . . . . . . . . . . . . . . 75
D.2 Mandelbrot Computation . . . . . . . . . . . . . . . . . . . . . . 76
D.3 Conway’s Game of Life . . . . . . . . . . . . . . . . . . . . . . . . 76
E Testing Gold Standards 77
F Class Index 79
G Source Code Extract 81
H Project Proposal 83
v
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 8/103
List of Figures
1.1 Build process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Iterative development process. . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Development environment. . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Software model of threads under CUDA. . . . . . . . . . . . . . . . . 122.4 CUDA hardware architecture. . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Various examples of loops. . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Outline call graph for main classes. . . . . . . . . . . . . . . . . . . . 20
3.2 Garbage collection of unreachable blocks. . . . . . . . . . . . . . . . . 21
3.3 Unification algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Outline of kernel extraction algorithm. . . . . . . . . . . . . . . . . . 34
3.5 Form of multiple dimension kernels. . . . . . . . . . . . . . . . . . . . 35
3.6 Array and object type templates for on-GPU execution . . . . . . . . 39
4.1 Effect on copy performance (host-to-device) of single vs. multipleallocations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Comparison of measured performance with model (using CUDA SDK). 48
4.3 Values of td and th for measurements (using CUDA SDK). . . . . . . 48
4.4 Fit of model (green) to component benchmarks. . . . . . . . . . . . . 50
4.5 Fit of model to Fourier Series benchmark, using previously calculated
parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6 Fit of model to Mandelbrot benchmark, using previously calculated
parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7 Speedups and overhead for Mandelbrot benchmark with fixed iteration
limit (250 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.8 Speedups and overhead for Mandelbrot benchmark with fixed grid size
(8000 × 8000). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.9 Overall times for simulation of Conway’s Game of Life. . . . . . . . . 55
5.1 Minimum finding algorithms . . . . . . . . . . . . . . . . . . . . . . . 61
D.1 3 generations of the Game of Life. . . . . . . . . . . . . . . . . . . . . 76
vi
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 9/103
List of Tables
2.1 CUDA memory spaces. . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Summary of JVM Instructions and their internal representation. . . . 22
3.2 Unification Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Tests made for each compiler state. . . . . . . . . . . . . . . . . . . . 43
4.2 Expected timings for overhead stages according to model. . . . . . . . 45
4.3 Model of overheads for component benchmark versions. . . . . . . . . 50
4.4 Model parameters, as measured using component benchmarks. . . . . 50
4.5 Speedup factors for the component benchmarks. . . . . . . . . . . . . 51
4.6 Summary of speedup factors. . . . . . . . . . . . . . . . . . . . . . . 55
4.7 Comparison of Java Grande benchmark timings with JCUDA. . . . . 56
D.1 Summary of Section 2 of the Java Grande Benchmark Suite. . . . . . 75
vii
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 10/103
List of Examples
1.1 Mandelbrot Set computation (kernel highlighted) . . . . . . . . . . . 3
2.1 Example of thread divergence. . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Graph for Mandelbrot computation . . . . . . . . . . . . . . . . . . . 23
3.2 UML sequence diagram for Visitor pattern operation. . . . . . . . . . 24
3.3 Basic block that causes difficulties when exporting. . . . . . . . . . . 25
3.4 Reuse of local variable locations. . . . . . . . . . . . . . . . . . . . . 27
3.5 Results from increment variable analysis computation. . . . . . . . . 28
3.6 Example inter-procedural may-alias computation. . . . . . . . . . . . 31
3.7 Non-termination of may-alias analysis. . . . . . . . . . . . . . . . . . 32
3.8 Mandelbrot control flow graph after various stages of loop detection. 33
3.9 Examples of the automatic dependency check. . . . . . . . . . . . . . 37
3.10 C++ code generation for float Cr = (x * spacing - 1.5f);. . . . 38
viii
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 11/103
Acknowledgements
Much thanks is owed to everyone who has given me guidance, feedback and
encouragement throughout this project. Specifically, my two supervisors, Dr
Andrew Rice and Dominic Orchard, have been invaluable in advising me at
tricky points. I owe particular thanks to Andy who stopped me from naıvelyattempting an even more ambitious project!
ix
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 12/103
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 13/103
CHAPTER 1Introduction
This chapter explains the motivation for using parallel architectures, before de-
scribing the scope of this project. I also provide a short overview of other relevant
work, and highlight the differences between these and the approach taken here.
1.1 Motivation
In the past, improvements in processor performance have taken the form of in-
creased clock speeds. However, since 2002, developments have come from the use
of multiple processors to solve independent parts of a problem in parallel [24].
Commodity parallel processing is now available not only as multi-core CPUs, but
also graphics processors (GPUs) that allow many more threads to be executed
in parallel with the restriction that they share a program counter—i.e. single
instruction multiple data (SIMD).
Unfortunately, most existing code is sequential, so the performance gains from
executing it on parallel architectures are limited. Often, it must be rewritten to
benefit. Automatic parallelisation aims to address this by analysing sequential
code during compilation, and identifying regions that can be executed in parallel.
However, determining whether dependencies exist between two regions of codeis undecidable in the general case [15]. Therefore any analysis must be approxi-
mate to some extent, and developers may find that small changes in code result
in disproportionate changes in performance. This suggests that a mix between
explicit and automatic parallelism might be desirable, with detailed feedback in
the automatic case being an important feature.
1
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 14/103
2 CHAPTER 1. INTRODUCTION
Bytecode (+ libraries)
Java Source
Scala Source
..
. Native Code
javac
scalac
Parallelising Compiler
JVM
Figure 1.1: Build process.
1.2 Project Description
This project focuses on the data parallel, or SIMD, pattern used on graphics pro-
cessors. For NVIDIA graphics devices, this is provided by extensions to C++ in
their CUDA framework [20]. However, this framework and similar cross-platform
APIs, such as OpenCL, operate at a low level, with developers manually handling
data transfers and ‘kernel’ invocations. Ports of CUDA to other languages gen-
erally still require kernels to be written in C++ (e.g. Py-CUDA [14] for Python,
and JCUDA [25] for Java).
This project allows these graphics processors to be used from a high level
language through both explicit annotations (parallel for loops) and automatic
analysis. For reasons of familiarity, I consider the Java Virtual Machine (JVM),
although similar techniques could be applied to other virtual machines such as
Microsoft’s Common Language Runtime. By operating at the bytecode level, no
modifications are made to the syntax of Java, and the compiler should work with
languages other than Java that compile onto the JVM. The compiler fits in asan additional step in the build process (Figure 1.1), taking a class file (compiled
bytecode) as input and producing a replacement along with any required supple-
mentary files. For clarity, this report gives examples in Java rather than bytecode
whenever possible.
One example used throughout this report is the computation of the Man-
delbrot Set (Example 1.1). The parallelising compiler extracts lines 4 to 16
(highlighted) as a two dimensional kernel that can be executed in parallel on the
GPU.
1.3 Related Work
Parallel computation is currently a huge research field. There have been many
attempts at both intuitive frameworks and effective automatic analysis. Ap-
proaches for parallelising Java have included both static analyses similar to my
work, and also direct ports of CUDA.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 15/103
1.3. RELATED WORK 3
1 p u b lic v oid compute() {2 f o r ( i n t y = 0 ; y < si ze ; y++) {3 f o r ( i n t x = 0 ; x < si ze ; x++) {4 f l o a t Zr = 0 . 0 f , Z i = 0 . 0 f ;5 f l o a t C r = ( x ∗ s p a c i n g − 1 . 5 f ) , C i = ( y ∗ s p a c i n g − 1. 0 f ) ;6 f l o a t ZrN = 0 , ZiN = 0 ;7 i n t i ;89 f o r ( i = 0 ; ( i < ITERATIONS) && ( ZiN + ZrN <= LIMIT) ; i++) {
10 Zi = 2 . 0 f ∗ Zr ∗ Z i + C i ;11 Zr = ZrN − Z i N + C r ;12 ZiN = Zi ∗ Z i ;13 ZrN = Zr ∗ Zr ;14 }1516 data [ y ] [ x ] = ( short ) ( ( i
∗25 5) / ITERATIONS) ;
17 }18 }19 }
Example 1.1: Mandelbrot Set computation (kernel highlighted)
1.3.1 JavaB [4]
Developed by Aart Bik in 1997, this work adopts a similar transformation ap-
proach to that of my project, although it targets multiple CPUs rather than
GPUs. It detects regions of code that can be executed in parallel, and producesa modified class file that uses Java threads to exploit the parallelism. The detec-
tion is partially automatic, with user input to make the analysis more accurate.
However, this input is at the level of ‘do variables x and y alias?’, not ‘should this
loop be run in parallel?’, and is specified to the compiler rather than in source
code.
1.3.2 Within JikesRVM [16]
This recent work (2009) implements automatic analysis within the JikesRVM vir-
tual machine (originally Jalapeno [2]), operating on intermediate code in a similarmanner to this project. It has the advantage over a compile-time approach that
all applications are modified, but requires that users install a specific virtual ma-
chine. Unlike static approaches, it has access to runtime information. However,
it cannot provide compile-time feedback, possibly resulting in unpredictable per-
formance. The benchmarks were all written by the author, and therefore it is
hard to know how effective the analysis might be on more typical code and full
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 16/103
4 CHAPTER 1. INTRODUCTION
applications.
1.3.3 JCUDA [25]This paper, also published in 2009, details a partial port of CUDA to an extended
Java syntax, providing the same low level interface to invoke kernels and copy
data. Kernels must still be written in C++. This gives an unusual mix of Java’s
high level approach with low level exposure to hardware. This project’s approach
of using annotations is preferable, since the source code may still be compiled with
a standard Java compiler, which simply ignores the parallel annotations.
Their performance results, based on hand written CUDA versions of the Java
Grande benchmarks [7], give a reference point for possible speedups (assuming
similar hardware).This work should not be confused with a library of the same name. The
jCUDA library provides access to a number of numerical routines, written using
CUDA, from within Java.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 17/103
CHAPTER 2Preparation
In order to complete this project successfully and develop a compiler that pro-
duced correct results with real benefits, it was crucial to be clear from the outset
what was required and to have a sensible plan for achieving this. This chapter
documents this process, and introduces the key concepts and theory on which
the compiler implementation is based.
2.1 Requirements Analysis
Given the large array of possible directions that this project could have taken,
there was a real need to set clear goals and requirements. For any inherently
technical software, such as a compiler, it is difficult to separate what it must
achieve from how this might be done. Requirements analysis aims to concentrate
on the first of these, setting out goals that can be verified objectively.
The following core requirements (C1 – C6) are derived from the success criteria
set out in the project proposal. They are made from the perspective of a developer
with no knowledge of compiler internals, and should capture their expectations.
This gives the first property to which all compilers must adhere:
C1. Correctness: Application of the compiler to JVM bytecode should not
affect the results of the code in any significant way.
Moving to the specific requirements on this project, the user should be able
to gain tangible benefits from using the compiler relatively easily. As it is an
optional step in the build process, this is required to warrant its inclusion.
5
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 18/103
6 CHAPTER 2. PREPARATION
C2. Performance: It must be possible to achieve improvements in execution
time by using the compiler.
C3. Usage: Any code modifications required to achieve speedups must be min-imal and transparent to standard compilers. These modifications must
make it possible to specify that a for loop is run in parallel. Furthermore,
if multiple tightly-nested loops are specified, the inner body should be run
in parallel across each of the dimensions.
For these benefits to be observed consistently, they should apply as universally
as possible:
C4. Scope: Ideally, it should be possible for any JVM instructions to be exe-
cuted on the graphics processor. However, GPU architectures place somerestrictions on what is possible, and for the core of the project, support is
restricted to use of basic arithmetic on primitive local variables and arrays.
This notably excludes support for exceptions, monitors and objects.
For various reasons, code specified for parallel execution may not be exe-
cutable as such. In this case, it is important that sufficient feedback is given:
C5. Feedback: There must be varying levels of output available that indicate
reasons if certain regions of code were not appropriate for parallel execution.
The final requirement on the core of the project ensures that the above can
be verified objectively by the developer:
C6. Verifiable: Supplementary tools and pools of example code must be made
available so that developers can evaluate the compiler objectively.
2.1.1 Extensions
The project proposal also outlines several areas where the project might be ex-
tended. These are formally set out below so that they can be assessed in theevaluation of the project.
E1. Automatic Detection of Loop Bounds: The number of iterations of a
loop should be inferred from the bytecode, without any user input.
E2. Automatic Dependency Checking: The compiler should detect, with
little help from annotations, regions of code for parallel execution.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 19/103
2.2. DEVELOPMENT PROCESS 7
Evaluation / Design
Refactoring
Implementation
Testing
Prototyping
Figure 2.1: Iterative development process.
E3. Runtime Checks: Some annotations (for example, any introduced by
E2) should be replaced with runtime checks (as in [16]) that can determine
whether to execute the kernel in parallel, or to use the original CPU code.
E4. Support for Objects on GPU: It would be useful to include object-
oriented code in parallel regions, within the scope of what the graphics
processor supports.
E5. Further Code Optimisation: Some optimisations that neither the vir-
tual machine nor the GPU compiler can make (due to splitting the code
between the CPU and GPU) should be reimplemented (e.g. loop invariant
code motion).
E6. Code Transformations: In cases where code is not suitable for parallel
execution, it may be possible to modify the code—perhaps by splittingloops into parallelisable and non-parallelisable chunks (loop fission) or by
matching common patterns (e.g. minimum finding).
2.2 Development Process
The development process adopted for this project was based on an iterative style
similar to the Spiral Model [5]. This enabled compiler stages to be tested on
real class files as early in the timetable as possible. From this position, iterations
consisted of the following main stages (Figure 2.1):
1. Evaluation of which stage or feature should be implemented next, based on
measurements and observations indicating which was most applicable.
2. Refactoring of existing code to allow the new feature to be integrated nat-
urally.
3. Implementation of the new stage or feature.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 20/103
8 CHAPTER 2. PREPARATION
4. Testing on an increasing pool of sample code, and fixing compiler bugs in
order that more code could be compiled correctly.
In this way, feedback from each iteration informed future development, avoid-ing wasted time on unnecessary features. The process also suited the integrated
testing strategy described in the next section.
A slight deviation compared with Boehm’s original Spiral model is the omis-
sion of a separate prototyping stage between 1 and 2. This was primarily due
to time constraints. However, some prototype code was written prior to starting
the main implementation in order to experiment with suitable internal represen-
tations (Section 3.2).
In comparison, the classical Waterfall method would have delayed integrated
tests until the later stages of the project, preventing benchmarks and measure-ments from directing design decisions such as selection of extensions.
2.3 Methods of Testing
Given the importance of maintaining correctness (Requirement C1), it seemed
natural that testing should include full integration tests over the whole compiler.
The first development iteration allowed a subset of JVM bytecode to be imported
into an internal representation, and re-exported to a new class file. As more
stages were added, these tests were rerun (i.e. regression testing) to ensure that
correctness was maintained, and new samples were added to test new features.
The integration tests consisted of a range of self-testing Java code to test the
compiler from both a black box and white box perspective. The first of these could
only be done by using code written by other developers, such as the Java Grande
Benchmark Suite [7]. The white box testing was done by specific examples written
to cover different features of the compiler.
At a finer granularity, analysis stages of the compiler were unit tested by
comparing their results, for the same pool of sample code, to a gold standard
produced manually.
2.4 Development Environment
The overall development environment is presented in Figure 2.2. Here I highlight
some key aspects of this and the decisions made.
Language. The Java language was the natural choice for implementing the com-
piler due to familiarity, along with some use of C++ as required by CUDA.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 21/103
2.4. DEVELOPMENT ENVIRONMENT 9
Development MachineWorking Copy
SRCFDuplicate SVN Repository
Public Workstation FacilityMaster SVN Repository
Replication on every commit
(via SSH+SVN with key authentication)
UCS Backups
CL File Server
earlybird (workstation)Core 2 Quad (2.66GHz)3MB Cache, 8GB RAM
GeForce 9600 GT512MB global memory
6 multiprocessors1.6GHZ
bing (dedicated)2× Pentium 4 (3.20GHz)2MB Cache, 1GB RAM
GeForce GTX 260896MB global memory
27 multiprocessors1.24GHZ
Double precision support
NFS
scp
SVN+SSH
NFS
Test Machines
Other Users
CodingSUN JDK 1.6.0.18
Netbeans
DissertationLatexTikZ
EvaluationSQLite
GNUPlot
Matlab
Figure 2.2: Development environment.
The availability of the ASM [6] library for reading and writing class files
also influenced this decision. Note that GCC 4.3.3 was used rather than
the more recent GCC 4.4 due to compatibility issues with CUDA 2.3.
Version Control. Subversion was used for storing all project files. This allowed
changes to be rolled back, and code to be transferred between machinesin a coherent manner. Since this dissertation was written in LATEX, with
diagrams written in TikZ and graphs produced using gnuplot and shell
scripts, most binary files could be reproduced and did not need to be stored.
Backups. These were predominantly provided by the regular PWF backups
made by the University Computing Service. The copy replicated on the
SRCF1 was intended to guard against accidental deletion of the master
repository and to reduce downtime if the PWF became unavailable. The
Computer Laboratory filespace was only used during testing, with results
being transferred to the working copy immediately, and therefore did notneed backing up.
Testing Hardware. Two machines with compatible graphics cards were avail-
able (earlybird and bing). Since the resources on earlybird were shared
with other users and an X server, bing was generally preferred.
1Student Run Computing Facility (http://www.srcf.ucam.org/)
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 22/103
10 CHAPTER 2. PREPARATION
2.5 The Java Platform
The Java language and corresponding virtual machine were developed in the
1990s, and made up the first mainstream platform of their type. More recent
alternatives, such as the Common Language Runtime, have used hindsight to
improve the design in some areas. However, Java remains commonly used and
compilers that target the JVM are still developed by third parties for other lan-
guages.
The virtual machine is stack-based, and its instruction set can be considered
RISC-like2 and mostly orthogonal (i.e. each instruction is available for each type).
The features below are key for this project. Java also supports garbage-collection,
objects, synchronisation monitors and exceptions.
Annotations. These have been available since Java 1.5 and are maintained in
the compiled bytecode. They have been used widely to allow tools to modify
and instrument bytecode after compilation. Source code utilising annota-
tions also remains compatible with a standard compiler.
Native Interface (JNI). This offers the facility for using code written in other
languages, which may make use of system calls not abstracted by the Java
libraries, at the cost of portability.
JNI specifies [18] the format of shared libraries that implement ‘native’
methods and the functions that allow interaction with Java objects andcode.
2.5.1 State
Data within the JVM can exist in four locations (from the perspective of byte-
code): the operand stack, the local variable stack, static variables and the heap.
All instruction operands are taken from the operand stack, and results are pushed
onto this. Local variables and statics can be used to store any of the ‘primitive’
datatypes3 [19]. Objects and arrays reside in the heap, and are identified by ref-
erences. Monitor synchronisation support and exception handling also introducestate associated with control flow.
2Reduced instruction set computers (RISC) provide only common instructions, choosing tooptimise these rather than offering more complex instructions.
3boolean, byte, short, int, long, float, double, char and references.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 23/103
2.6. NVIDIA CUDA ARCHITECTURE 11
2.5.2 Performance
Originally, JVMs interpreted bytecode at runtime, causing very poor perfor-
mance. Whilst this is still a common belief, since the introduction of Just-In-Time(JIT) compilation, there have been studies suggesting that performance is com-
parable to that of C and C++ [17]. In some cases, the studies even show that
Java can take advantage of runtime information to outperform C.
It was suggested in the late 1990s that Java might be an appropriate language
for future high performance computing (HPC) applications [23]. Whilst this has
never materialised, a recent study concludes that in most cases there is no reason
why Java shouldn’t be used for computationally expensive applications, although
they do note that there are significant overheads in communication intensive
applications [3].
2.6 NVIDIA CUDA Architecture
When released in 2007, CUDA was one of the only general purpose frameworks for
graphics processors. Previously, general purpose computation had to be formu-
lated as graphics operations [22]. CUDA supports many programming constructs
including conditional and looping control flow, although does lack support for re-
cursion and virtual function lookups [20, Appendix B.1.4].
Operations that are invoked on the GPU are executed asynchronously from
the perspective of the CPU code. There are therefore some useful constructsprovided in the CUDA API that allow accurate timing of operations.
The framework is based on C++ with keywords for specifying whether func-
tions should be compiled for the GPU or host, and in which memory space vari-
ables should be stored. It also adds a syntax for invoking kernels. With each
new version of CUDA, the provided compiler (nvcc) moves closer to supporting
all the features of C++.
2.6.1 Thread Model
The threading model exposed to software by CUDA is illustrated in Figure 2.3.
Each thread must execute the same code, but is given coordinates so that it
may operate on different data. The two level approach is due to the hardware
architecture which also places limits on the dimensions of both grids and blocks.
The CUDA hardware architecture is shown in Figure 2.4. Each block is
assigned to a multiprocessor which contains 8 processors each executing 4 threads.
As such, 32 threads can be executed concurrently in each block, in what is called
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 24/103
12 CHAPTER 2. PREPARATION
Block
Thread
Grid
Figure 2.3: Software model of threads under CUDA.
a warp. There are therefore advantages to ensuring that the number of threads in
a block is a multiple of this. It is also worth noting that there is only one double
precision unit per multiprocessor, and as such there is a significant penalty for
performing double precision arithmetic.
Since each processor within a multiprocessor must execute the same instruc-
tions, there is a performance hit whenever threads within a single warp diverge.
This occurs when two or more threads take different paths through the control
flow graph, as in Example 2.1. In this case, the hardware must execute the
different paths sequentially.
CUDA also provides primitives for synchronization between threads, however,these are not used in this project. Without these, the thread model can be
considered as a parallel for loop over a number of dimensions.
2.6.2 Memory Model
The hardware model also has implications for the software memory model. As
shown in Figure 2.4, there are a variety of memory areas, each with different
properties as summarised in Table 2.1.
When memory accesses are consecutive within a warp (i.e. thread i is reading
arr[n + i]), then the hardware can coalesce these into fewer memory accessesthat utilise the full width of the memory bus.
It is worth noting that often there is less memory available on the GPU than
the host. Therefore computations offloaded to the GPU may fail ‘early’, or be
forced to revert to CPU execution, giving a ‘wall’ in performance.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 25/103
2.6. NVIDIA CUDA ARCHITECTURE 13
Host Memory
PCI-e Bus
Device Memory (Global)
Multiprocessor N (up to about 30)
Multiprocessor 2
Multiprocessor 1 Shared Memory
Instruction
Unit
Registers
Processor 1
Registers
Processor 2
Registers
Processor 8
Constant Cache
Texture Cache
Figure 2.4: CUDA hardware architecture.
(Based on a figure used in various NVIDIA presentations.)
1 i f ( i n d e x & 1 ) s [ i n d e x >> 1 ] = s i n ( i n [ i n d e x >> 1 ] ) ;2 e l s e c [ inde x >> 1] = cos ( i n [ i ndex >> 1 ] ) ;
1 i f ( index < W) s [ i n d ex ] = si n ( i n [ i n d ex ] ) ;
2 e l s e c [ inde x − W] = c o s ( i n [ i n d e x − W] ) ;
index ranges between 0 and 2W − 1, where W is a multiple of the warp size. The second case
runs roughly twice as fast, since there is no thread divergence (for W = 51200, the timings are
0.102ms and 0.043ms respectively.)
Example 2.1: Example of thread divergence.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 26/103
14 CHAPTER 2. PREPARATION
Memory Location Cached Access Scope Size5
Registers On-chip N/A6 Read/write One thread 16384Shared On-chip N/A6 Read/write All threads in a block 16KB
Local Off-chip × Read/write One thread ↑Global Off-chip × Read/write All threads and host 896MBTexture Off-chip Read All threads and host ↓Constant Off-chip Read All threads and host 64KB
Table 2.1: CUDA memory spaces.
2.7 Common Compiler Analysis Techniques
In this section, I introduce some common methods used within compilers [1], and
indicate why each is applicable.
2.7.1 General Dataflow Analysis
Dataflow analysis describes a common framework used for determining properties
of programs [13], such as which variables must be transferred to the graphics
processor (Section 2.7.3) and the behaviour of writes to variables (Sections 2.7.4
and 3.3.2).
The result of an analysis for an instruction or block of code, R(b) ∈ X , is
given by Equation 2.1, where (X,
) is a complete lattice.
Definition 1. A complete lattice is a partially ordered set, in which every subset
has a unique least upper bound (its join or lub) and a unique greatest lower
bound (its meet or glb). We denote the join of the whole set as and the meet
as ⊥.
The function children(n) is usually defined to be either the predecessor set
( forward analysis) or the successor set (backward analysis), with F init giving the
value at entry points or exits respectively.
can be chosen either as the join
(lub) or meet (glb) operator. F b : X → X is the transfer function that alters a
result in accordance with the instruction or block b.
R(b) =
F b
c∈children(b) R(c)
children(b) = ∅
F b (F init) children(b) = ∅(2.1)
5 Sizes for a GeForce GTX 260 card.6 Neither registers nor shared memory need a cache, since both are accessed within a single
clock cycle.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 27/103
2.7. COMMON COMPILER ANALYSIS TECHNIQUES 15
Entry
Exit
(a) Single Entry/Single Exit
Entry 1
Entry 2
(b) Multiple Entries
Entry
Exit 1
Exit 2
(c) Multiple Exits
Figure 2.5: Various examples of loops.
Since the control flow graph may be cyclic (due to loops), R(b) must be
computed iteratively until a fixed point solution is reached. Initially, each R(b)
is set to the least element
⊥ ∈X . The number of iterations until convergence
depends on the order in which instructions and blocks are considered. For forwardanalysis, they should be considered from start to end, and for backward the
converse. For each specific dataflow analysis, it is necessary to prove that the
analysis will converge. This can be shown to be a consequence of ( X, ) having
finite height, and F b being monotone. The proof of this, and also convergence of
the specific analyses that follow, is given in Appendix A.
2.7.2 Loop Detection
JVM instructions provide only unstructured control flow with branches to arbi-
trary labels, and all structured information regarding loops and conditionals is
discarded at compile-time. Therefore, in order to extract loop bodies for parallel
execution, some of this structure must be reconstructed using loop detection. Ide-
ally, this should be done without needing user annotations (as per Requirement
C3).
Definition 2. A natural loop is defined as a loop with only a single entry point.
In general, detection is made difficult by the possibility of loops with multiple
entries and exits as in Figure 2.5. However, by restricting detection to the case
of natural loops, a simple algorithm can be used [1, p655]. This case still includesall loops either expressible using standard for and while constructs in high-level
languages, or suitable for GPU execution (see Section 2.6.1).
Definition 3. In a control flow graph of basic blocks, we define a block m to be
a dominator of another block n if all execution paths to n contain m.
Definition 4. A back edge is defined to be an edge whose end dominates its
start.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 28/103
16 CHAPTER 2. PREPARATION
Since a single entry point must dominate every block in the loop body, the
edge to the entry from the end of the body must be a back edge. Therefore, each
natural loop corresponds to a back edge in the control flow graph, and can be
detected by the following simple algorithm.
Step 1 Calculate the set of dominators D(b) of each block b.
Step 2 Find any edge m → n such that n ∈ D(m). This gives a natural loop
with body from n to m.
Since each dominator of a block b must also be a dominator of all of b’s im-
mediate predecessors, the dominator set of b, D(b) ∈ ℘(Blocks), can be described
by:
D(b) = {b} ∪ p∈pred(b)
D( p) (2.2)
This is a form of forward dataflow analysis over the lattice (℘(Blocks), ⊆)
using the meet operator (i.e. set intersection) and the transfer function in Equa-
tion 2.3. This can therefore be computed iteratively, initialising each D(b) to
the empty set ∅. Since set union is monotone, the analysis is also guaranteed to
converge.
F b(x) = x ∪ {b} (2.3)
Step 2 can be performed trivially to find all loops. The set of blocks in the
loop body from n to m is given by S n(m), defined recursively as follows:
S n(b) =
{n} if b = n
{b} ∪ p∈pred(b) S ( p) otherwise(2.4)
2.7.3 Live Variable Analysis
Definition 5. A variable is live at a given point if, on some execution path
starting from that point, the variable is read before it is written to.
Live variables [1, p608] can be calculated using backward dataflow analysis onthe lattice (℘(Vars), ⊆) using the join operator (i.e. set union) and the transfer
function given in equation 2.5, where Write(n) and Read(n) indicate the sets of
writes and reads made by an instruction n.
F n(x) = (x \ Write(n)) ∪ Read(n) (2.5)
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 29/103
2.7. COMMON COMPILER ANALYSIS TECHNIQUES 17
2.7.4 Constant Propagation
Forward dataflow analysis can be used to determine the value of a variable at
a point in code, if it is a constant [1, p632]. For each variable v and blockb, we maintain a result Rv(b) taken from the ‘flat’ lattice over constants—i.e.
({⊥, } ∪ Constants, ) where:
x y ⇐⇒ (x = ⊥) ∨ (x = y) ∨ (y = ) (2.6)
Rv(b) =
c ∈ Constants if constant c is the value of v at the end of b
if the value of v is not constant at the end of b
⊥ if no writes are made to v before the end of b(2.7)
This can be computed using the join operator with transfer function F n,v for
variable v as follows:
F n,v(x) =
c if n assigns c to v
if n writes a non-constant to v
x otherwise
(2.8)
2.7.5 Data DependenciesWhen considering whether two instructions or regions of code can be run in
parallel, the data dependencies between them must be considered. There are three
types: true dependencies (read-after-write), anti-dependencies (write-after-read)
and output dependencies (write-after-write). We can then determine whether
there are any loop-carried dependencies that prevent the loop from being executed
in parallel. The core requirements of the project require that the programmer
will consider this before marking a loop as parallel.
Determining dependencies automatically is desirable, but becomes hard as
soon as memory references are introduced, which Java does through objects and
arrays. Difficulty arises because writes to two distinct references can affect the
same state. Alias analysis aims to determine statically whether this may occur
at a given point in code. There are two variations of this problem, may-alias
and must-alias. For this project, may-alias is required, since by overestimating
conflicts to a memory address, it will always be safe (see Section 3.3.3 for this
analysis).
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 30/103
18 CHAPTER 2. PREPARATION
In languages such as Java, “with if statements, loops, dynamic storage, and
recursive data structures”, alias analysis can be shown to be undecidable by
reduction to the Halting Problem [15].
2.8 Summary
This chapter has given objective and verifiable requirements for the project. The
development and testing strategy that was employed to ensure these were met has
also been outlined. Finally, brief introductions to the Java Platform, NVIDIA’s
CUDA framework and some common compiler analysis techniques have been
given. It was from this base of knowledge and planning that the project was
started.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 31/103
CHAPTER 3Implementation
This chapter first outlines the overall implementation structure as well as the
central data structure. Some new analysis techniques are then introduced and
developed, before descriptions of individual compiler stages are given. Finally,
the overarching compiler tool is briefly explained.
The size of the implementation1, despite containing a significant proportion
of boilerplate for supporting the JVM instruction set, is too large to describe in
detail. As such, this chapter gives a high-level view, identifying specific details
only when necessary. Further information is given in the appendices.
3.1 Overall Implementation Structure
The compilation process can be divided into five main stages: importing classes;
loop detection; kernel extraction including dependency checks; code generation;
and finally exporting new class files.
The final structure of the project implementation is shown by Figure 3.1. This
gives a high level view of class interactions with time running (roughly) down the
page. To keep the diagram relatively simple, commonly used classes have been
left out, notably graph.* (see Section 3.2), analysis.dataflow.SimpleUsed(see Section 3.3.4) and analysis.BlockCollector . Colour coding is used to
indicate when each class was added to the compiler.
1SLOCCount gives a total of 7686 lines.
19
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 32/103
20 CHAPTER 3. IMPLEMENTATION
1Translation between bytecode
and an internal representation.3
Detection of loop bounds and
increments to give trivial loops.5
Extraction of 1D kernels based
on annotations, and generation
of GPU wrappers.
2Detection of loops for repre-sentation as structured control
flow.
4Generation of C++ from byte-
code.6
Support of multiple dimension
kernels.
7Basic automatic dependency
analysis.
Parallelise External Libraries
JOpt Simple
bytecode.ClassFinder
ClassNode bytecode.ClassImporter
bytecode.AnnotationImporter
ASM
bytecode.MethodImporter
LoopDetector LiveVariable
LoopTrivialiser IncrementVariables Dataflow
LoopNester Tree
KernelExtractor AliasUsed BasicCheck
CombinedCheck AnnotationCheck
ReachingConstants Dataflow
LiveVariable
cuda.CUDAExporter
cuda.Beautifier
cuda.Helper
cuda.BlockExporter
bytecode.ClassExporter cuda.CppGenerator
ASM
bytecode.BlockExporter bytecode.InstructionExporter
cuda.CUDAExporter NVCC
(Colouring denotes the development cycle on which the code was written.)
Figure 3.1: Outline call graph for main classes.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 33/103
3.2. INTERNAL CODE REPRESENTATION (ICR) 21
Unreachable
Weak References
Strong References
Figure 3.2: Garbage collection of unreachable blocks.
3.2 Internal Code Representation (ICR)
The internal representation of classes, methods and fields under transformation
is central to the compiler. This provides similar capabilities to the Java re-
flection classes but with added support for modification. Therefore, the graph
package contains ClassNode, Method, state.Field, Annotation and Modifier
as ‘replacements’ for the corresponding reflection classes. The Method class in
turn references a graph giving the implementation. It is on this graph that the
compiler analyses and transformations act.
3.2.1 Code Graph
The implementation graphs are made up of two main types of block: basic blocksand loops. For a block b, the notation pred(b) is used to denote its immediate
predecessors in the graph, and succ(b) its successors.
Definition 6. A basic block is a sequence of instructions i1, . . . , in where only the
first instruction may have multiple predecessors (|pred(ik)| = 1, for 1 < k ≤ n),
and only the last multiple successors (|succ(ik)| = 1, for 1 ≤ k < n).
In my implementation, successors are represented by a standard set. However,
in order to minimise the housekeeping required when modifying the graph, the
predecessor set is stored internally as a weakly referenced list (util.WeakList).
Then whenever it is accessed, it is returned as a standard set. By using weak
references, any code that becomes unreachable can be garbage-collected as shown
in Figure 3.2. A list is used internally to count how many links exist from each
predecessor, making it easy to update. For example, a switch instruction might
branch to the same block for multiple cases. If one of these were changed, it
would be necessary to determine whether to modify the predecessor set of the
destination block.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 34/103
22 CHAPTER 3. IMPLEMENTATION
Result producing instructions (Producer)Arithmetic *ADD, *SUB, *MUL, *DIV, *REM Convert *2*
*AND, *OR, *SHL, *SHR Negate *NEG
Constant *CONST_*, LDC, *PUSH Compare *CMP*ArrayLength ARRAYLENGTH NewArray ANEWARRAY
NewMultiArray MULTIANEWARRAY NewObject NEW
CheckCast CHECKCAST InstanceOf INSTANCEOF
Read *LOAD, GETSTATIC, GETFIELD Call INVOKE*
Stateful instructions (Stateful)Write *STORE, PUTSTATIC, PUTFIELD Call INVOKE*Read *LOAD, GETSTATIC, GETFIELD Increment IINC
Branching instructions (Branch)Return RETURN Condition IF*
ValueReturn *RETURN TryCatch N/ASwitch TABLESWITCH, LOOKUPSWITCH Throw ATHROW
Other Instructionsunsupported RET, JSR (these are used for finally blocks), MONITOR*StackOperation SWAP, POP, POP2, DUP, DUP2, DUP X1, DUP X2
DUP2 X1, DUP2 X2
Table 3.1: Summary of JVM Instructions and their internal representation.
The instructions within a basic block are connected in a directed acyclic graph
that gives the dataflow representation of the code. In general, this forms a graph
rather than a tree since each instruction can be used as an argument to multiple
other instructions. Whilst the original bytecode will have an order for all instruc-
tions within a basic block, this ordering is only important for stateful instructions.
Therefore, each basic block also holds a timeline of stateful instructions, and a
final branch instruction. This approach sits between the common techniques:
linear lists of instructions; and complete dataflow graphs. A full summary of
instructions and their internal groupings is given in Table 3.1.
Definition 7. An instruction is stateful if the time at which it is executed may
affect its result or effect.
Loops are represented by the start and end blocks for the body.
As an example, the ICR for the Mandelbrot computation (Example 1.1) is
shown in Example 3.1.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 35/103
3.2. INTERNAL CODE REPRESENTATION (ICR) 23
WRITE y
READ y
READ this
READ ->height
IF >= THEN
VOID RETURNWRITE x
READ x
READ this
READ ->width
IF >= THEN
INC y BY 1
0
0
WRITE Zr
WRITE Zi
READ x
READ thisREAD ->spacing
WRITE Cr
0.0
0.0
MUL
INTTO
FLOAT
SUB 1.5
READ y
READ this
READ ->spacing
WRITE Ci
MUL
INTTO
FLOAT
SUB 1.5WRITE ZrN
WRITE ZiN
0.0
0.0
WRITE i 0
READ i
READ this
READ ->iterations
IF >= THEN
READ ZiN
READ ZrN
IF > THEN
READ this
READ ->data
READ y
READ []
READ x
READ i
READ this
READ ->iterations
WRITE []
255
MUL
DIV INT TO
SHORT
INC x BY 1
READ Zr
READ Zi
READ Ci
WRITE Zi
READ ZrN
READ ZiN
READ Cr
WRITE Zr
READ Zi
READ Zi
WRITE ZiN
READ Zr
READ Zr
WRITE ZrN
INC i BY 1
2.0
MUL
MUL
ADD
SUB
ADD
MUL
MUL
ENTRY
Example 3.1: Graph for Mandelbrot computation
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 36/103
24 CHAPTER 3. IMPLEMENTATION
be:BlockExporter w:Write a:Arithmetic ie:InstructionExporter
accept(ie)
visit(w)
getState()
accept(ie)
visit(a)
getOperandA()
getOperandB()
Example 3.2: UML sequence diagram for Visitor pattern operation.
3.2.2 Visitor PatternIn order for other classes to traverse this structure easily, the visitor pattern [10,
p331] is utilised for both the control and dataflow graphs. The abstract classes
graph.BlockVisitor and graph.CodeVisitor emulate multiple dispatch which
is not supported natively by the JVM. With multiple dispatch, the choice of
method to invoke is based on the runtime type of all arguments. The JVM does
support single dispatch , where the runtime type of the object, but not the argu-
ments, is considered. The visitor pattern makes use of this in its implementation,
as shown in Example 3.2.
In addition to the above, a decorator (analysis.CodeTraverser) is providedfor the code graph that causes a child visitor to do a depth-first traversal of a
given dataflow graph.
3.2.3 Bytecode to ICR Translation
The internal code representation must be interchangeable with JVM bytecode.
The stack-based nature of the JVM makes this relatively straightforward in the
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 37/103
3.2. INTERNAL CODE REPRESENTATION (ICR) 25
double[][] arr = {{0.1, 0.2}};ICONST 1
ANEWARRAY "[D"
DUPICONST 0
ICONST 2
NEWARRAY double
DUP
ICONST 0
LDC double 0.1d
DASTORE
DUP
ICONST 1
LDC double 0.2d
DASTORE
AASTORE
ASTORE 2
(1) Bytecode
1
NewArray 0
2
NewArray 0 0.1 1 0.2
4. Write 3. Write 1. Write 2. Write
Multiple arrows out of a node imply a DUP instruction (or
similar) is needed.
(2) Graph
Example 3.3: Basic block that causes difficulties when exporting.
standard cases, although there are some issues that make the general case more
difficult.
Rather than producing import and export code from scratch, a class reading
library, ASM [6], was used. This provides visitor pattern access to the files rather
than producing any data structures. The library does this to remain lightweight
and fast for applications that can perform transformations in a single pass (i.e. do
not need to store the bytecode). It also allows use with whatever data structures
an application might require.
For importing, the timeline and dataflow graph for a basic block can be built
in a single pass through the code using symbolic execution, with the standard
operand stack containing graph nodes rather than real results. In cases where
the operand stack is not empty at the end of a basic block (as occurs with
ternary conditionals—expr ? a : b), the values are stored as being ‘emit-
ted’ by the block and successor blocks are marked as ‘accepting’ values of the
respective types. These values can then be accessed using the RestoreStack
pseudo-instruction.Unfortunately, exporting to bytecode is only easy in cases where stack opera-
tions (e.g. DUP, POP, . . . ) are not required, since these are represented implicitly
by the structure of the dataflow graph rather than individual nodes (see Example
3.3). Therefore, the compiler makes use of the correct bytecode sequence that is
seen for each basic block in the input class file by maintaining a cache 2.
2Using a WeakHashMap so entries are not held unnecessarily if a basic block is discarded.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 38/103
26 CHAPTER 3. IMPLEMENTATION
In the case of code inserted by the compiler, no stack operations are required
since:
• Dataflow graphs form a tree (i.e. results are never used more than once, sono need for DUP etc. instructions).
• Reads and result producing calls occur in the timeline in the same order as
given by a depth-first search of the dataflow graph.
• Results of calls are always used.
It is therefore possible to produce bytecode by performing a depth-first search of
the dataflow graph corresponding to each timeline entry in order.
It is worth noting that all code in a transformed class is exported from the
above structure. Whilst it may have been possible to simply copy unmodified
methods, or even portions of methods, from the original class, this approach
would have been less elegant, and required either a second pass of the original
file, or storage of all original bytecode. A consequence of this decision is that
it is necessary for all instructions (including monitor and exception operations)
to be handled by the code representation, even if they cannot be executed on a
graphics card.
3.2.4 Type Inference
Compilation to bytecode loses the majority of type information, so it is necessary
to infer types, in order to copy state onto and off a graphics processor. Primitive
types are clear from the instruction used to load the value. However, reference
types can only be inferred by usage. This is achieved using a Damas-Milner style
type-checking algorithm [8]. At each instruction, we take a fresh type corre-
sponding to the usage and unify (Figure 3.3) this with the type maintained for
the object operated on. This process ensures that the stored type is valid for all
contexts. If unification ever fails, then this indicates that the input bytecode was
badly typed. Table 3.2 gives details of the unification operation performed for
some instructions.Unfortunately, the existing Type class provided by ASM had a private con-
structor, so could not be extended to include the unification functionality. There-
fore, graph.Type is based heavily on the ASM code, supplemented with unifica-
tion and some convenient methods for dealing with array types.
Type inference is slightly complicated by reuse of local variables—for instance,
in Example 3.4, variables i and j are likely to share a location on the local variable
stack. We can overcome this by using live variable analysis (Section 2.7.3) to
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 39/103
3.3. DATAFLOW ANALYSIS 27
if x is a supertype of y thenx ← y
else if y is a supertype of x theny ← xelse
return failureend if
Figure 3.3: Unification algorithm.
Instruction Unification performed
PUT/GETSTATIC The value passed/returned must unify withthe type of the static field.PUT/GETFIELD The object type must unify with the owner
class of the field, and the value passed/re-turned must unify with type of the field.
<T>ALOAD/<T>ASTORE The object given must unify with an array of element type <T>.
CALL Each argument’s type must unify with thecorresponding type in the method descriptor.
Table 3.2: Unification Details
1 f o r ( i n t i = 0 ; i < 10 ; i++) f ( i ) ;2 f o r ( i n t j = 0 ; j < 10; j ++) g( j ) ;
Example 3.4: Reuse of local variable locations.
determine the live ranges of each variable and ensure that the types across a
range are consistent. Since the unification algorithm is simple and not time-
consuming, these unification steps are integrated into the live variable analysis
code. Thus, at the end of each method import, live variable analysis is performedon the code to infer the types.
3.3 Dataflow Analysis
A general framework for dataflow analysis was outlined in Section 2.7.1. Here
specific dataflow analyses that were developed for use in the compiler are de-
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 40/103
28 CHAPTER 3. IMPLEMENTATION
i++; Ri = 1, Rj = 0j++; Ri = 1, Rj = 1if(...) {
i += 3; Ri = 4
, Rj = 1} else {
i += 2; Ri = 3, Rj = 1j = i + 1 0 ; Ri = 3,Rj = i++; Ri = 4,Rj =
}Ri = 4,Rj =
Example 3.5: Results from increment variable analysis computation.
scribed.
3.3.1 Support for Arrays and Objects
The live variable analysis previously given explicitly excludes array and object
accesses. However, for analysis of JVM bytecode, this is insufficient. The simple
approach taken here defines the effectn function such that array and object vari-
ables become live when any of their elements or fields are either read or written.
The only way the variable can stop being live is if it is directly assigned a value
(e.g. a new array or object reference). This ensures safety.
3.3.2 Increment VariablesThis analysis returns information about integer-typed variables for which it is
possible to statically determine the effect of a region of code. The result for each
variable is taken from a flat lattice over integers ({}∪ Z, ) with:
x y ⇐⇒ (x = y) ∨ (y = ) (3.1)
The result for a variable v at the end of a block b, Rv(b), has the behaviour
described by Equation 3.2 (also see Example 3.5).
Rv(b) =
n ∈ Z if the overall effect on v is to increment by n
if v is written to in a more complex manner(3.2)
Note that this also includes ‘decrement’ variables (i.e. n < 0). The results
can be calculated using forward dataflow analysis with the join operator (least
upper bound) and a transfer function as defined below. Each Rv(b) is initialised
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 41/103
3.3. DATAFLOW ANALYSIS 29
to 0.
F n,v(X ) =
X + i if n increments v by i and X
∈Z
if n writes to v in a more complex manner
X otherwise
(3.3)
Theorem 1. Iterative computation of increment variables converges.
Proof. Since our lattice does not contain ⊥, we must adopt a different style of
proof. Suppose the analysis does not terminate, then there must be a loop which
increments a variable v. However, there must be an entry point and for the outer-
most loop, this gives a fixed increment for v. Therefore, the join on entering the
loop will give
for v, and since
∀n,v.F n,v(
) =
the analysis must terminate.
Hence, we have a contradiction and our assumption of non-convergence must beincorrect.
3.3.3 May-Alias
May-alias analysis is used in the compiler to establish which variables may be
affected by a write. This is then used both to determine which variables must be
copied back off the graphics card, and also in automatically detecting dependen-
cies. Computing may-alias sets is the most complex analysis performed in the
compiler. The approach presented here is an approximation, flagging some cases
as inaccurate.
Whilst reference states (i.e. array elements and object fields) are represented
within the compiler as chains of reads (e.g. a[i] would first read a and then
an element), for the description here, states will be considered as in Equation
3.4 (where c ∈ Call represents the return value of a call). I also define loose
states (Equation 3.5) that allow comparison ignoring array indices, and a function
(Equation 3.6) to ‘loosen’ states.
State ::= v | s | c where v ∈ Var, s ∈ Static, c ∈ Call
| State.f | State[expr] where f ∈ Field (3.4)
LooseState ::= v | s | c where v ∈ Var, s ∈ Static, c ∈ Call
| LooseState.f where f ∈ Field
| LooseState[•] (3.5)
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 42/103
30 CHAPTER 3. IMPLEMENTATION
loosen(s) =
loose( p).f if s = p.f
loose( p)[•] if s = p[expr]
s otherwise
(3.6)
Forward dataflow analysis can then compute, for each block b, a result M (b),
where M (b)(s) gives the set of states which may share the same reference as s.
For example, consider the code:
a = b; a[f(x)] = objA; b[g(x)] = objB;
Statically, f (x) and g(x) may be unknown, so we should deduce that any
element in either array a or b could point to either objA or objB (i.e.
{a[
•]
→ {objA, objB
}, b[
•]
→ {objA, objB
}}).
We use the lattice over functions (LooseState → ℘(State), ) with as de-fined in Equation 3.7. Therefore, joins can be considered as pointwise union.
M ∗m : State → ℘(State) (Equation 3.8) gives the closure under dereferencing of a
function m : LooseState → ℘(State).
f g ⇐⇒ ∀s.f (s) ⊆ g(s) (3.7)
M ∗m(s) =
m(loosen(s)) ∪ {x.f | x ∈ M ∗m( p)} if s = p.f
m(loosen(s))
∪ {x[e]
|x
∈M ∗m(a)
}if s = a[e]
m(loosen(s)) otherwise
(3.8)
The transfer function is defined in Equation 3.9, where aτ ← b indicates a
write to a of value b type τ . The 5 different cases will be referred to as A to E .
F n(m) = λy.
Recurse(m, c) if n = c and y = c
M ∗m(x) if n = vref ← x and y = x
M ∗m(x) ∪ m(y) if n = a[•]ref ← x and ∃a ∈ M ∗m(a) y = a[•]
M ∗m(x) ∪ m(y) if n = o.f ref ← x and ∃o ∈ M ∗m(o) y = o.f
m(y) otherwise(3.9)
The initial value, M init, at the entry of a code graph must be provided and
should indicate which states might alias.
We must also maintain a set of states R from the lattice (℘(State), ⊆) that
contains all states which might be returned from a method3. The transfer function
3Note that R is associated with the function rather than any particular block.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 43/103
3.3. DATAFLOW ANALYSIS 31
int x = ...; m = {}, R = {} B
List temp; m = {temp → {temp}}, R = {}List[] temp2 = new List[1]; m = {temp2 → {new0}},R = {} B
List[] data = ...; m= {data → {data}}
,R= {}
B
temp = data[0]; m = {. . . , temp → {data[0]}}, R = {} B
temp = data[x]; m = {. . . , temp → {data[x]}}, R = {} B
temp2[0] = data[100]; m = {. . . ,new0[•] → {data[100]}}, R = {} C
temp2[0] = data[x]; m = {. . . ,new0[•] → {data[100], data[x]}}, R = {} C
return f(temp, temp2[0]); m = {. . . }, R = {data[100], data[x]} A
List f(List a, List b) { M init = {a → {data[x]}, b → {data[100], data[x]}}if(Math.sqrt(4.0) < 4.0) m = {. . . }, R = {}
return a; m = {. . . }, R = {data[x]}else m = {. . . }, R = {data[x]}
return b; m = {. . . }, R = {data[x], data[100]}}
The case of F n that is applied is given on the right hand side.
Example 3.6: Example inter-procedural may-alias computation.
Gn below computes R using the current m : LooseState → ℘(State) as context.
Gn(m, R) =
R ∪ M ∗m(s) if n = RETURN(s)
R otherwise(3.10)
This allows Recurse(m, c) to be defined as R from recursive analysis on f
(where c = f (a0, . . . , an)), with M init given by Equation 3.11. However, no alias
information other than R is returned, so if the function contains reference writes
(i.e. xref ← y) then the analysis must be marked inaccurate.
M init(s) =
M ∗m(ai) if s = vi and i ≤ n
M ∗m(s) if s ∈ Static
∅ otherwise
(3.11)
Example 3.6 gives an example of the results achieved when the inter-
procedural case is used.
The analysis that has been described so far may not terminate (Example3.7). Therefore, the number of iterations is bounded, and if convergence does not
occur, the analysis is flagged inaccurate.
3.3.4 Usage Information
At various stages in the compiler, it is useful to know the set of accesses made
by a block of code. Accesses are either direct or indirect :
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 44/103
32 CHAPTER 3. IMPLEMENTATION
{a[•] → {a0}}while(a[i] != null) {
b = a[i].next; {a[•] → {a0, a0.next,... }, b → {a0.next, a0.next.next,. . . }}a[i] = b;
{a
[•] → {a0, a0.next,...
}, b
→ {a0.next, a0.next.next,. . .
}}}
Example 3.7: Non-termination of may-alias analysis.
Definition 8. An access is direct if it accesses the value of a variable or static
field.
Definition 9. An access is indirect if it accesses a value in the heap—i.e. it
requires one or more dereferences. Each indirect access can be described by a list
of indices—for example, arr[i].video.data[x][y] corresponds to [i,x,y].
The class graph.dataflow.SimpleUsed collects sets of state for the cate-
gories: variables used, statics used and state directly written. It also collects a
set of classes used. This is done simply by unioning across all instructions (i.e.
nothing is ever removed from these sets).
The case of indirect accesses is much harder to compute due to the effects of
aliasing. Therefore, the may-alias analysis described in the previous section is
used to form sets of all state that could have been written to or read from. This
is all done within the graph.dataflow.AliasUsed class.
3.4 Loop Detection
Loop detection is done in three stages: natural loop detection, loop trivialisation
and loop nesting. Example 3.8 shows the effect of these. The first is implemented
as a version of the algorithms in Section 2.7.2, restricted to cases with both a
single entry and single exit. This corresponds to the style of loops that can
be executed in parallel on GPUs (see Section 2.6.1). Loop nesting is done by
checking whether a loop is contained in the body of another.
3.4.1 Loop Trivialisation
In order to execute a loop on a graphics processor, it is necessary that the di-
mensions and limits of the loop can be determined. The compiler detects these
automatically for trivial loops as defined below. The definition is more inclusive
than that used in JavaB [4], with positive or negative increments to the loop
variable permitted anywhere in the loop body.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 45/103
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 46/103
34 CHAPTER 3. IMPLEMENTATION
S ← root level loopswhile S is not empty do
l ← S.removeif extract(l) fails thenS.add(l.children)
end if end while
Figure 3.4: Outline of kernel extraction algorithm.
Definition 10. A loop is trivial if there is only a single conditional branch that
exits the loop after comparing the loop index i with an expression. Furthermore,no writes can occur before the branch, and i must be an ‘increment variable’ as
defined by the analysis of Section 3.3.2.
Therefore a trivial loop is defined by its index, its limit and a mapping between
its increment variables (of which the index must be one) and their increments.
These can be detected by the increment variables analysis in Section 3.3.2, along
with inspection of the exit condition, and are represented by extended loop nodes
in the code graph.
3.5 Kernel Extraction
In order to extract kernels from loop bodies, the tree provided by the nesting stage
must be considered, since it is not possible to extract both an outer loop and one
of its inner loops independently. In this project, outer loops are parallelised
preferentially since this minimises the number of data copies to and from the
GPU. This gives the outline algorithm for kernel extraction shown in Figure 3.4.
For the one-dimensional case, extract(l) simply uses a dependency checker
to determine whether the loop l can be run in parallel, and if so attempts to
extract it. Note that an extraction may fail due to limitations of the CUDAarchitecture—this type of failure is handled exactly as though the dependency
check failed.
For the n-dimensional case, the first level is checked as for the 1D case. For
subsequent levels, it is required that there is only one loop child and also that
the form in Figure 3.5 is followed before the level may be added as a further
dimension of the kernel.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 47/103
3.6. DEPENDENCY ANALYSIS 35
for(...) { Outer Loop
v = constant (∀v ∈ Incinner) Checked using Constant Propagation (Section 2.7.4)for(...)
{...
}Parallel Inner Loop (checked by dependency checker)
v += constant (∀v ∈ Incouter)}
Figure 3.5: Form of multiple dimension kernels.
There may be other viable approaches that don’t always select the outer loop
if the compiler were capable of leaving state in GPU memory between kernel
invocations, as was done in [16] with multi-pass loops, but these are not considered
here.
3.5.1 Copy In
The copy in state for a kernel is the set of state that must be supplied to the
GPU for kernel execution. This is the set of variables made live by the loop body
plus any dimension indices not already in this set.
3.5.2 Copy Out
Since the kernel is executed in parallel, all direct writes should be local to thekernel (i.e. not live immediately following the loop). If this were not the case,
then an output dependency would exist. The copy out set is therefore given
by the indirect writes set computed by analysis.dataflow.AliasUsed (Section
3.3.4). When the may-alias analysis is flagged as inaccurate, all copy in state is
included in the copy out set.
3.6 Dependency Analysis
The dependency analysis portion of the compiler is used by the kernel extraction
stage (Section 3.5) to determine whether it is safe to parallelise a given loop.
Both the user annotation and automated checks implement the same interface
(DependencyCheck) so can be used interchangeably.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 48/103
36 CHAPTER 3. IMPLEMENTATION
3.6.1 Annotation Based
Developers can use method annotations to both express explicit parallelism and
override automatic analysis. The annotation (@Parallel) has a single propertyloops that takes an array of index variable names for trivial loops which should be
executed in parallel. This still requires that the corresponding loop is detected
and found to be of a trivial form. The class must have been compiled with
debugging information so that variable names are available.
3.6.2 Automatic
This test consists of two checks to ensure there are no loop-carried dependencies:
Direct Writes. All direct writes must be to variables that are local to the loopbody—i.e. the variable must not be live either at the start of the loop body,
or immediately following the loop.
Indirect Writes. Momentarily ignoring the effect of aliasing, we compare each
write with all accesses (including itself) to the same loose state (i.e. states
that are the same ignoring array indices, see Equation 3.5). To be sure
they don’t access the same location on different iterations, there must be an
increment variable at the same position in each list of indices (see Definition
9 of indirect accesses). The variable must also have been incremented by
the same amount in each access. Several examples are given in Example3.9.
The effects of aliasing are managed by the AliasUsed class, which expands
each write to all states it may have affected. The may-alias analysis is
initialised using information provided by @Restrict annotations. When
marked as such, the programmer is asserting that the variable, and all
references reachable from it, do not alias with any other state. If the may-
alias is flagged inaccurate, then the loop is not accepted.
3.7 Code Generation
The top level algorithm for code generation deals with the difficulty inherent in
code generation for CUDA, which can fail due to both unsupported instructions
(e.g. exceptions, monitors and memory allocation) and calls to methods in classes
not supplied to the compiler.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 49/103
3.7. CODE GENERATION 37
short[] f(short[] data, short[] dummy) {if(Math.sqrt(4.0) < 4.0) {
return data;
} else {return dummy;
}}
void compute() {short[][] dummy = new short[height][];
for(int y = 0; y < height; y++) {for(int x = 0; x < width; x++) {
...
dummy[y] = data[y];
f(dummy[y], data[y])[x] = ...;
}}}
(1) Correct Acceptance
while(i < LIMIT) {arr[i] = ...
i++;
arr[i] = ...
i++;
}(2) False Rejection
while(i < LIMIT) {arr[i] = ...
i += 2;
arr[i] = ...
i--;
}(3) Correct Rejection
Example 3.9: Examples of the automatic dependency check.
Before outputting code for any method or kernel, all of the static fields, classes
and methods on which it depends (Section 3.3.4) must be exported. This is
implemented by buffering all C++ code and recursing onto a new buffer whenever
a call is reached. Only when a method is completely exported, along with its
recursions, is its buffer flushed. As a result, some methods may be exported and
then never used, since they were exported for a kernel that later failed to export.
I will now describe how the C++ code generation itself works, before moving
onto describing the ‘launcher’ method that is called in place of parallelisable
loops to execute the kernel. Details regarding naming conventions are given in
Appendix B.
Finally, an extension of Java’s PrintStream (cuda.Beautifier) indents code
based on the location of curly braces. This was done to facilitate debugging.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 50/103
38 CHAPTER 3. IMPLEMENTATION
ILOAD 2 (x)I2F
ALOAD 0 (this)GETFIELD spacing:F
FMUL
LDC 1.5f
FSUB
FSTORE 5
(1) Bytecode
Read this
Read x
Read spacing
× 1.5f
−
Write Cr
(2) Code Graph
const jint t0 = v2 INT;
const Object<Data samples Mandelbrot> t1 =v0 2101451235;
const jfloat t2 = DEVPTR(t1.device)->spacing;
const jfloat t3 = (jfloat) t0;
const jfloat t4 = t3*t2;
const jfloat t5 = 1.5f;
const jfloat t6 = t4-t5;
v5 FLOAT = t6;
(3) C++
Example 3.10: C++ code generation for float Cr = (x * spacing - 1.5f);.
3.7.1 C++
Exporting the basic blocks to C++ is performed with a depth-first search of each
timeline entry in turn. This ensures that stateful instructions are executed in the
correct order, and that all arguments are generated before their use. Results from
instructions (i.e. Producers, see Table 3.1, page 22) are assigned to temporary
const variables. The names of these temporary variables are stored in a map so
that each instruction is only visited once. An example of a basic block and its
exported form is given in Example 3.10.Control flow is exported using a combination of while, for recognised loops,
and goto, for all conditionals and loops not detected.
3.7.2 Kernel Invocation
The kernel is invoked on the graphics processor using the CUDA runtime library.
This requires dimensions for both the grid of blocks and the blocks themselves
(see Section 2.6.1). Dimensions are chosen using the following rules and heuristics
to maximise performance and ensure the execution succeeds. For each dimension
i, the grid size is denoted by gi, the block size by bi and the number of requirediterations by ri.
1.
i bi is less than or equal to the maximum number of threads per block.
This is governed by register and shared memory usage of the kernel.
2. b1 must be a multiple of the warp size (therefore the number of threads per
block will also be a multiple of the warp size), or less than a single warp.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 51/103
3.7. CODE GENERATION 39
Object<T> Array<T>
jobject object jarray object
T* host T* host
T* device T* devicejsize length
Figure 3.6: Array and object type templates for on-GPU execution
3. bi+1 > 1 =⇒ bi ≥ ri
4. gi = min{ri/bi, Gi} where Gi is the maximum size of the grid in dimension
i.
This means that the developer does not need to consider the specification of
their specific graphics card, or have knowledge of the threading model.
3.7.3 Data Copying
Primitive types are transferred directly into the corresponding C++ types. In the
case of doubles, data must first be switched to single precision if it is to be used
on cards without double precision support. For arrays of doubles, a single check
is made to determine whether this is necessary in order to avoid unnecessary
overheads.For reference types, the C++ types, Object<T> and Array<T> (Figure 3.6),
are defined using template meta-programming, enabling recursive types to be
built up (e.g. Array<Array<Object<struct foo> > >). The object identifier
allows objects to be ‘switched’ during GPU computation (for example, reversing
the rows of a 2D array), while the host pointer is used to record the location in
host memory where the object is held. It would have been possible to free this
memory while the GPU code executed, reallocating space to perform the export.
However, I felt that the further allocation overheads outweighed any benefit.
On import, each reference is placed in a map to ensure it is not imported
twice. If this did occur and both copies were modified, then only one set of
changes would be preserved by the export. The map is also used as a list of
objects that must be exported. Without this, an object that became unreachable
as a result of the kernel might not be exported, even though it may still be
reachable from elsewhere in the program.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 52/103
40 CHAPTER 3. IMPLEMENTATION
Arrays
Arrays with primitive elements are imported using JNI functions that force
the JVM to provide direct access to the array without copying the data(GetPrimitiveArrayCritical and ReleasePrimitiveArrayCritical ). This
avoids the need for two copies (first into a C buffer and then onto the device) at
the expense of halting the virtual machine’s garbage collector.
However, for arrays with reference-typed elements, each element must be read
separately and then imported appropriately, causing two copy stages.
Objects
Since CUDA devices support C structures, these can be used to represent Java
objects on the graphics processor. Unfortunately, populating these via the JNIAPI requires a function call to access each field of each object, which creates
noticeable overheads for large objects or large numbers of objects.
Memory Allocation
In order to minimise the number of memory allocations required, all device mem-
ory is allocated with a single allocation, and then divided up as needed. This
also results in improvements in copy performance (see Section 4.2.1).
Similarly, the host memory for an array of objects is allocated in a single
batch rather than one-by-one.
Statics
Rather than passing statics to the kernel as arguments, which must in turn be
passed on to any other methods called, they are stored in CUDA’s __constant__
memory. The read-only nature of this memory is not a problem, since a static will
never be directly written to (Section 3.5.2). There are also possible performance
gains as it allows caching by the GPU. The restricted size of __constant__
memory (64Kb for the card used in development) is unlikely to be an issue, since
even on 64bit machines, Array<T> only requires 28 bytes4
.4JVM array lengths are defined as 32bit integers even on 64bit machines: jsize → jint →
int.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 53/103
3.8. COMPILER TOOL 41
3.8 Compiler Tool
The compiler is brought together in tools.Parallelise. This makes calls to the
stages of the compiler: import, the 3 stages of loop detection, kernel extraction
(which in turn performs dependency analysis and code generation) and finally
export. A description of the available arguments and their effects is given in
Appendix C. These are parsed by an open source library, “jopt simple” 5.
The compiler also invokes the CUDA compiler (nvcc) automatically, so that a
developer does not need to understand the process of producing JNI compatible
libraries from CUDA code.
3.8.1 Feedback to the User
Compiler feedback is provided at a variety of levels6, ranging from just fatal
errors through to full debugging information. Logging messages are managed,
like command line arguments, by an external library “log4j”7. As a standard
problem in many applications, it was unnecessary to implement a custom set of
logging classes. Log messages are categorised by the module of the compiler and
a level.
As well as controlling the verbosity of messages, when the logging level is set
to debug , debugging output is added to the generated CUDA code. This then
provides information regarding the invocation sizes used (Section 3.7.2) and a
breakdown of the GPU execution time into the following stages:
1. Importing data from Java (using JNI) and allocating any extra host memory
required, as well as calculating how much device memory to allocate.
2. Allocating device memory and copying data to the GPU.
3. Executing the kernel on the GPU.
4. Copying data back from the GPU.
5. Exporting any data back to Java as required and freeing memory resources.
3.9 Summary
In this chapter, I have given a complete overview of the internals of the parallelis-
ing compiler. This includes the theoretical basis for the analysis— most notably
5http://jopt-simple.sourceforge.net/6The possible levels are FATAL, ERROR, WARN, INFO, DEBUG and TRACE.7http://logging.apache.org/log4j/
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 54/103
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 55/103
CHAPTER 4Evaluation
This chapter evaluates the compiler and presents a model of the overheads caused
by data copying to the GPU. An objective comparison with related work in the
literature is also provided. Descriptions of all sample code, including their origins,
are given in Appendix D.
4.1 Correctness
As described in Section 2.2, the compiler was developed by the gradual introduc-
tion of stages. It was checked that all sample code (see Appendix D) supported
at the time continued to produce correct results after compilation.
Unit tests were also performed for the analysis stages (Table 4.1). These
consisted of gold standards (Appendix E) for each of the sample codes that could
be compared with the given results.
For the scope of target programs defined by C4 (see Section 2.1), all tests
were passed. When moving outside this scope, specifically making use of object
Compiler Stage Tests
Loop Detection Correct number.Loop Trivialisation Correct increments and bounds.Kernel Extraction Correct dimensions and copy in state. Safe copy
out state.Code Generation Specific code for different aspects (e.g. objects).Dependency Analysis Safe results.
Table 4.1: Tests made for each compiler state.
43
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 56/103
44 CHAPTER 4. EVALUATION
inheritance, the code generation and dependency analysis stages both wrongly
assume that methods are final so that they can be exported, since it is not
possible to know what classes may later extend and override these methods. The
alternative of rejecting code generation in these cases would prevent many valid
compilations, since the final keyword is often omitted, even if applicable.
4.2 Performance
The performance benefits achievable using the compiler depend on the combina-
tion of the speedup due to parallel execution on the graphics processor, and the
overheads due to data copying.
The execution speedup is difficult to predict due to the differences between
GPU and CPU architectures. CPU execution time depends heavily on the
amount of instruction level parallelism that can be achieved through out-of-order
execution. Whilst the GPU is simpler in this respect, its performance can be
affected by the locality of memory accesses (due to coalescing, see Section 2.6.2)
and also the runtime effect of thread divergence (see Section 2.6.1). This sec-
tion therefore comments on the measured speedups rather than trying to predict
them.
The overheads are more predictable, allowing a model to be developed and
then tested against measurements made on the collection of sample code.
In order to achieve fair results, benchmarks were run on the dedicated machine(bing) with the GPU in dedicated mode. As far as possible, other programs were
terminated before benchmarking to avoid contention for CPU time. Benchmarks
were repeated 10 times and the median of these used. All execution timings were
made against wall clock time. Using CPU time would have given biased results,
since time spent on the GPU appears as I/O and would not have been included.
4.2.1 Model of Overheads
The overheads related to off-loading computation onto the graphics processor can
be split into four categories as in Section 3.8.1: importing from Java; copying tothe GPU; copying back from the GPU; and exporting to Java. The operations
within these stages (Section 3.7.3) suggest the following costs. In general, I
expect these to behave linearly (i.e. an initial latency l∗, plus a further cost nt∗depending on the size n of the operation).
Stopping Garbage Collection (lg). Since the ‘critical’ array access JNI func-
tions are used, there will be a constant cost for stopping garbage collection.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 57/103
4.2. PERFORMANCE 45
Stage Overhead Time
Importlg + (lr
p A( p)) + Sls for all parameters p
where S is the number of statics used.
Copy On ld
p R( p) + td
p M ( p) for all parameters pCopy Off lh
p R( p) + th
p M ( p) for copy off parameters p
Export lf + (lw
p A( p)) + (tw
p E ( p)) for copy off parameters p
Table 4.2: Expected timings for overhead stages according to model.
JNI Reads (lr). For each read from Java, there will be a constant cost. This
also applies to the ‘critical’ array accesses, since no copy is performed.
Constant Setting (ls). When statics are used, CUDA constant memory must
be set.
Copies (ld, td, lh and th). Copies in each direction are likely to have different
bandwidths. I ignore the allocation cost at the beginning of the ‘copy on’
stage, since this will be negligible compared to the copy.
JNI Writes (lw and tw). The ‘critical’ array access functions allow changes to
be aborted, suggesting that a copy-on-write may occur internally. This is
therefore modelled as a linear cost.
Freeing (lf ). Finally, there is the cost of freeing the used device memory.
This gives the expressions in Table 4.2 for the overheads associated with each
of the four stages. These rely on knowing certain values for each parameter p of
the kernel.
• The number of accesses A( p) required to read or write the parameter from
Java.
• The total amount of memory E ( p) that is exported by these accesses.
• The number of memory regions R( p) that this data is spread out over.
• The total amount of memory M ( p) that the data occupies once in the C++
code (this is higher due to the representations shown in Figure 3.6).
These can be calculated recursively based on the type of the parameter (and
array lengths), as shown in Equations 4.1 to 4.4.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 58/103
46 CHAPTER 4. EVALUATION
Accesses
A(primitive) = 0
A(array of primitive) = 1
A(array of τ ) = length · (1 + A(τ ))
A(object τ ) =
τ ∈fields(τ ) 1 + A(τ )
(4.1)
Exported Memory
E (primitive) = sizeof(primitive)
E (array of τ ) = sizeof(pointer) + (length · E (τ ))
E (object τ ) = sizeof(pointer) +
τ ∈fields(τ ) E (τ )
(4.2)
Memory Regions
R(primitive) = 0
R(array of object τ ) = 2 + (length·τ
∈fields(τ )R(τ ))
R(array of τ ) = 1 + (length · R(τ ))
R(object τ ) = 1 +
τ ∈fields(τ ) R(τ )
(4.3)
Total Memory
M (primitive) = sizeof(primitive)
M (array of τ ) = 3 · sizeof(pointer) + 4 + (length · M (τ ))
M (object τ ) = 3 · sizeof(pointer) +
τ ∈fields(τ ) M (τ )
(4.4)
Measurement of Copy Parameters
As a preliminary test of the copy on and copy off models, a CUDA program that
measured the time taken to copy N arrays of N doubles (i.e. 8N bytes) was
written in C++. It became apparent that the model only holds when the copies
are within a single device memory allocation. The test program was therefore
extended to allow for a variety of memory locations both on the host and the
device. These were as follows:
Separate Each array is allocated separately with a call to the relevant memory
allocator.
Sequential The memory for all arrays is allocated at once, and then the locations
allocated sequentially from this pool.
Non-sequential Again the memory for all arrays is allocated at once, but the
regions of memory are allocated alternately from the start and end of this
pool. This was designed to simulate the case where the order of the copies
could not be predicted and would not be ‘in-order’.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 59/103
4.2. PERFORMANCE 47
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0.0×100
1.0×103
2.0×103
3.0×103
4.0×103
5.0×103
6.0×103
7.0×103
8.0×103
T i m e ( m s )
N
Comparison of Copies
Separate
Single AllocationCubic FitAs per model
Figure 4.1: Effect on copy performance (host-to-device) of single vs. multipleallocations.
As shown in Figure 4.1, the predicted model (Nld + 8N 2td) is only followed
in the two cases of single allocation, with the separate case exhibiting cubic
behaviour. This also shows the improvement in copy performance that can be
achieved by performing just a single allocation. An appropriate modification was
therefore made to the compiler.
The model parameters given by gnuplot’s fitting function were ld = (8.07 ±0.11) × 10−3ms and td = (1.614 ± 0.002) × 10−6ms byte−1. The respective values
for device-to-host copy were lh = (8.56±0.15)×10−3ms and th = (2.548±0.003)×10−6ms byte−1.
NVIDIA provide a similar tool in their SDK that measures memory copy
performance. This gives results which can then be used to estimate td and th.
Timing single copies does not give sufficient accuracy to measure the latencies, sothe values for ld and lh are taken from above. However, as shown in Figure 4.2,
for small copies (< 50KB) the model does not hold, with td and th taking varying
values as shown in Figure 4.3. For the remainder of the evaluation, I continue to
assume the simple linear model, but split each parameter into ‘small’ and ‘large’
values, for below and above 50KB respectively (i.e. td,small, td,large, . . . ).
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 60/103
48 CHAPTER 4. EVALUATION
0.001
0.01
0.1
1
10
100
1.0×103
1.0×104
1.0×105
1.0×106
1.0×107
1.0×108
T i m e ( m s )
Size (bytes)
Comparison of Copy Performance with Linear Model
Host to DeviceDevice to Host
ld + Ntdlh + Nth
Figure 4.2: Comparison of measured performance with model (using CUDASDK).
0.0×100
1.0×10-6
2.0×10-6
3.0×10-6
4.0×10-6
5.0×10-6
6.0×10-6
7.0×10-6
1.0×103
1.0×104
1.0×105
1.0×106
1.0×107
1.0×108
T i m e / b y t e ( m s / b y t e )
Size (bytes)
Values of td and th assuming Constant Latency
tdth
Figure 4.3: Values of td and th for measurements (using CUDA SDK).
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 61/103
4.2. PERFORMANCE 49
4.2.2 Component Benchmarks
Here I present a number of micro-benchmarks that compute sin 2 x +cos2 x over a
sequence of random numbers (length N ). Each version of the benchmark storesthe sequence in a different manner. This allows the overheads model to be tested
on code produced by the compiler. It also evaluates whether speedups can be
achieved when very little computation is performed. The versions produced were:
Baseline The baseline version stores the numbers in a local 1D array.
Statics In this case, the numbers are stored as a static variable.
Objects Each number is placed inside a class, and the computation is performed
as a method of this class.
Two Dimensions The numbers are stored in a rectangular array with roughly
the same number of elements. The dimensions for the array were chosen as
√N ×
N
√N
.
Using the model in the previous section, the overhead times for each of these
versions can be predicted—as shown in Table 4.3. Measurements for a range
of N can then be used to assess whether the model fits accurately and to give
estimates for its parameters. Due to the nature of the parameters, it is necessary
to consider the complete data set (e.g. in the static case lg and ls could be varied
arbitrarily provided lg + ls gave a suitable value). The measured values are shown
in Table 4.4 and are reasonably consistent with those measured in the previous
section. The slight shift in copy latencies may be due to an overlooked difference
between the C++ test program in the previous section, and the copies performed
for offloading Java. Recalculating the rates in Figure 4.3 using the new values of
ld and lh gives values that coincide with the rates calculated here. An indication
of the quality of the fits is given by the graphs in Figure 4.4.
The benchmark timings also give an indication of the execution speedup. The
results are summarised in Table 4.5. The performance when executed on the CPU
was the same for all versions.The baseline benchmark is encouraging as it shows that even when little com-
putation is performed on the GPU, the overhead associated with transferring data
to the graphics card is not prohibitive. The statics version performs similarly, as
would be expected, with a slightly improved speedup possibly due to the array
pointer being held in constant memory which can be cached (see Section 2.6.2).
When an object array is considered, the overheads (although vastly improved
by using single memory allocation) make offloading to the GPU impractical. The
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 62/103
50 CHAPTER 4. EVALUATION
Import ExportBaseline lg + lr lf + lw + (8 + 8N )twStatics lg + lr + ls lf + lw + (8 + 8N )twObjects lg + 2Nlr lf + 2Nlw + (8 + 16N )tw2D lg + 2lr
√N lf + 2
√Nlw + (8 + 8
√N + 8N )tw
Copy On (for Copy Off replace ld and td with lh and th)Baseline ld + (28 + 8N )tdStatics ld + (28 + 8N )tdObjects 2ld + (28 + 32N )td2D (1 +
√N )ld + (28 + 28
√N + 8N )td
Table 4.3: Model of overheads for component benchmark versions.
lg lr ls ld lhms 7.37 × 10−3 3.76 × 10−4 2.08 × 10−2 1.05 × 10−2 1.04 × 10−2
lf lwms 1.43 × 10−1 2.40 × 10−4
td,small td,large th,small th,large twms/byte 9.56 × 10−7 6.82 × 10−7 1.75 × 10−6 1.23 × 10−6 1.97 × 10−9
Table 4.4: Model parameters, as measured using component benchmarks.
B a s e l i n e
T i m e
N
Import
T i m e
N
Copy On
T i m e
N
Copy Off
T i m e
N
Export
S t a t i c s
T i m e
N
T i m e
N
T i m e
N
T i m e
N
O b j e c t s
T i m e
N
T i m e
N
T i m e
N
T i m e
N
2 D
T i m e
N
T i m e
N
T i m e
N
T i m e
N
Figure 4.4: Fit of model (green) to component benchmarks.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 63/103
4.2. PERFORMANCE 51
Version Execute Only Inc. Overheads
Baseline 192 40Statics 239 41Objects 220 0.182D 229 22
Table 4.5: Speedup factors for the component benchmarks.
0
0.005
0.01
0.015
0.02
0.025
T i m e ( m s )
N
Import
0123456789
10
T i m e ( m s )
N
Copy On
02468
101214161820
T i m e ( m s )
N
Copy Off
0
0.05
0.1
0.15
0.2
0.25
T i m e ( m s )
N
Export
Figure 4.5: Fit of model to Fourier Series benchmark, using previously calculatedparameters.
inaccuracy of the model during the import stage may be due to unexpected
overheads associated with the map used for listing references. Further work is
needed to isolate this and make suitable improvements.
The overheads in the two-dimensional case are also much reduced by the
single memory allocation and this improves the overall speedup from 5.6 to 22.
4.2.3 Java Grande Benchmark Suite [7]
The Java Grande benchmark suite was used as a source of external unbiased code
that could be passed to the compiler. The sequential code was annotated and
fed to the compiler. Timings were then compared between the GPU and original
versions.
A full description of the suite is given in Appendix D—including an explana-
tion of which benchmarks were used. Here I give the results of the Series and
Crypt benchmarks, relating these to the hypothesised overheads model, and also
the hardware characteristics of CUDA.
Series: Fourier Series of (x + 1)x
This benchmark exhibited the biggest speedup factor (187 overall). The break-
down of the execution time shows that only 0.5% of the GPU time was due to
overheads. These overheads were generally in agreement with those predicted
using the parameters measured in the previous section (Figure 4.5).
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 64/103
52 CHAPTER 4. EVALUATION
0
20
40
60
80
100
T i m
e ( m s )
N
Import
0
200
400
600
800
1000
1200
T i m
e ( m s )
N
Copy On
0
200
400
600
800
1000
1200
T i m
e ( m s )
N
Copy Off
0
2
4
6
8
10
T i m
e ( m s )
N
Export
Figure 4.6: Fit of model to Mandelbrot benchmark, using previously calculatedparameters.
Crypt: IDEA Encryption/Decryption
Whilst only using integer operations, the graphics processor execution still
achieves a significant speedup factor (8.7). This is again helped by the relatively
small amount of data required for computation. The lower factor is probably due
to the CPU performing better on integer benchmarks.
4.2.4 Mandelbrot Set Computation
The Mandelbrot set is defined as the set of complex values c such that the absolute
value of zn remains bounded for any value of n, where zn is defined as:
zn = 0 if n = 0
z2
n−1+ c if n > 0
(4.5)
For computation, we must define a limit on the size of n (the iteration limit )
and also a bound on values of zn. Here the bound is set as 4.0 (as used in the
original code). The iteration limit means that it is possible to vary the amount
of computation performed on the data, altering the significance of the overheads.
Again, the measured parameters from Section 4.2.2 were used to predict the
overheads, giving very accurate results (Figure 4.6). This demonstrates that the
model does not suffer from overfitting.
Turning to the speedup during the actual execution portion, I first consider
the case where the iteration limit is fixed at 250 (as in [16]) and the grid size isaltered. Figure 4.7 plots the speedup achieved on the execute portion, and also
the overall speedup when overheads are included. The reason the execute-only
speedup is lower in this benchmark could be due to the effect of thread divergence
(described in Section 2.6.1). This means that the calculation for each pixel takes
as long as the ‘slowest pixel’ in its warp.
Similarly, the variation in speedup can be investigated as the iteration limit is
altered. This is done for a fixed size computation (8000 × 8000 grid) and plotted
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 65/103
4.2. PERFORMANCE 53
0
10
20
30
40
50
60
70
80
90
100
S p e e d u p F a c t o r
Execute Only
Overall
0
25
50
75
100
0.0×100
2.0×103
4.0×103
6.0×103
8.0×103
1.0×104
1.2×104
% O
v e r h e a d
N
Figure 4.7: Speedups and overhead for Mandelbrot benchmark with fixed itera-tion limit (250 ).
in Figure 4.8. Since the overheads are fixed, they become less significant as the
number of iterations rises, with the overall speedup tending towards the execute
speedup.
4.2.5 Conway’s Game of Life
Conway’s Game of Life is a cellular automaton. The evolution of each cell in a
2D grid requires independent computation (see Appendix D for details).
The simulation of such a ‘game’ provides an interesting benchmark for parallel
computing, since there is a trade-off between the naıve computation that is easy toparallelise, and more sophisticated algorithms that are less suited. In particular,
I will consider the Hashlife implementation [12] that accelerates simulation by
recording the evolution subgrids to avoid later recomputation.
As shown in Figure 4.9, the naıve algorithm running on the GPU in fact runs
slower than on the CPU. Both are much slower than HashLife. One reason for
this is that all data is copied back and forth from the graphics card on each
iteration, even though the data is not used by the host in between each kernel
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 66/103
54 CHAPTER 4. EVALUATION
0
10
20
30
40
50
60
70
80
90
S p e e d u p F a c t o r
Execute Only
Overall
0
25
50
75
100
0.0×100
2.0×102
4.0×102
6.0×102
8.0×102
1.0×103
1.2×103
% O
v e r h e a d
Iterations
Figure 4.8: Speedups and overhead for Mandelbrot benchmark with fixed gridsize (8000 × 8000).
invocation. Other work [16] introduces multi-pass loops, where the loop body
only consists of GPU code, allowing data to be left on the GPU. In the case of
this specific benchmark, a more advanced approach would be needed, since a new
array is used for each iteration rather than double buffering.
A second issue is the manner in which a cell’s neighbours are counted. Since
the world is stored as an array of booleans, there is an if ...else control flow
structure for each neighbour. This suggests that the execution may be suffering
from thread divergence.
4.2.6 Summary
These results show that significant performance improvements are possible over
a range of benchmarks. Whilst accurate predictions of execute speedups have
not been possible, the factors measured are consistent with those expected given
the number of cores on the GPU and also the number of double precision units
available. Both the execute and overall speedups for each benchmark are sum-
marised in Table 4.6. These are combined using the geometric mean (see [9] for
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 67/103
4.2. PERFORMANCE 55
0
5000
1000015000
20000 10
1001000
10000
0
5000
10000
15000
20000
25000
30000
Time (ms)
GPU
CPU
Grid Size
Generations
Time (ms)
Figure 4.9: Overall times for simulation of Conway’s Game of Life.
Benchmark Double Precision Execute OverallBaseline 192 40Statics 239 41
Objects
220 0.182D 229 22Mandelbrot (250 iterations) × 83 39Mandelbrot (8000 × 8000 grid) × 106 79Life × – –1
Series 189 187Crypt × – 8.7
182.4 20.6
Table 4.6: Summary of speedup factors.
reasons why this is appropriate) to give an average speedup factor of 20.6.
The overheads model has also been evaluated, with the parameters measured
from the component benchmarks giving accurate predictions of the overheads in
other cases. However, some aspects are not fully understood (i.e. GPU behaviour
for small copies, and object import time).
1The Life speedup factors were all very low ( 1) but varied considerably. Therefore, therewas not a suitable single value.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 68/103
56 CHAPTER 4. EVALUATION
Benchmark Series (Floating Point) Crypt (Integer)Data Size 104 105 106 3 · 106 2 · 107 5 · 107
CPU on bing (ms) 17971 182894 2878469 414 2190 5344
This Project on bing (ms) 99 968 9358 41 245 545JCUDA on Tesla C1060 (ms) 110 1040 10140 20 160 450
Table 4.7: Comparison of Java Grande benchmark timings with JCUDA.
4.3 Accuracy of Dependency Analysis
Using the same gold standards as were used for testing (i.e. @Parallel annota-
tions), it was possible to measure the accuracy of the automatic analysis. This
showed that an accuracy of 85% ( 29
34
) was achieved for the range of benchmarks.
In cases where the check was too conservative, the behaviour could be explained
by the may-alias and checking algorithms.
4.4 Comparison with Existing Work
This project’s approach was compared with other related work in Section 1.3. In
terms of performance, published results allow some quantitative comparisons to
be made regarding the speedups achieved. Unfortunately, the JikesRVM work
[16] uses a much older card (GeForce 7800) so is incomparable. JCUDA [25]
uses a similar card (NVIDIA Tesla C1060, 1.3GHz, 240 cores) to that of bing
(NVIDIA GTX 260, 1.24GHz, 216 cores). Their work ports the Java Grande
benchmarks [7] to C++ so that the GPU performance can be compared to that
of raw Java. My results for the Series and Crypt benchmarks (Section 4.2.3)
are broadly similar as shown in Table 4.7.
Turning to the automatic dependency analysis, neither [16] nor javab [4]
give accuracy figures for their analyses (javab instead compares the number of
parallelisable loops with the total). However, it would be expected that the
approach of [16] could use runtime information to produce more accurate results
than either this work or javab.
4.5 Summary
In this chapter, three key aspects of the project have been evaluated. First,
tests were used to demonstrate compiler correctness within the required scope.
A model for overheads was then developed and tested. It was found to be accu-
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 69/103
4.5. SUMMARY 57
rate with large copies, but the bandwidth to the card behaved in a manner not
fully understood with small copies. The modelling also indicated a significant im-
provement that could be made to the compiler. Execution speedups were found
to be in line with what would be expected based on the hardware architecture.
Finally, investigations were made into the accuracy of the automatic analysis.
Some quantitative comparisons with existing work have also been made, adding
to those in Section 1.3.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 70/103
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 71/103
CHAPTER 5Conclusions
This dissertation has highlighted the key aspects of the project and the compiler
that it produced. It has explained the existing work and knowledge that was used
(Chapter 2), and how this allowed a novel compiler to be developed (Chapter 3).
Evaluation of the compiler (Chapter 4) has shown it to both maintain correctness
and provide significant speedups in the majority of sample cases. This chapter
assesses the project formally with respect to its goals, and suggests future work
to improve the compiler.
5.1 Comparison with Requirements
Ultimately, the project should be judged by whether it meets the requirements
that were elaborated from the project proposal in Section 2.1. The evaluation
allows each of these to be considered in this section.
The tests that were carried out during the project (Section 4.1) showed that
the compiler maintained correctness whenever it succeeded in compiling. A
marginal case is exhibited when the graphics processor’s memory is exceeded, this
causes the JVM to exit gracefully with a suitable error message. Use of recursion
within parallel loops is a notable case where compilation fails, due to restrictionsof CUDA. This evidence demonstrates that the project meets Requirement C1.
The various performance benchmarks that have been evaluated (Section 4.2)
show a clear benefit from using the compiler, satisfying Requirement C2.
The annotations that the compiler uses to assess code (@Parallel and
@Restrict) are both unobtrusive and transparent to the standard Java com-
piler. Transparency allows source code containing these annotations to be built
normally for environments without compatible GPUs. The annotations also al-
59
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 72/103
60 CHAPTER 5. CONCLUSIONS
low explicit marking of parallel for loops of multiple dimensions. Therefore,
Requirement C3 is met. The nature of @Parallel also means that the loop
bound detection extension (E1) was fully implemented.
The scope of code that can be compiled for GPU execution meets the re-
quirements of C4, although recursive code cannot be used due to restrictions in
current GPU architectures. Extension E4 for support of objects has also been
completed up to the limits of the architecture (i.e. no inheritance or allocation).
The compiler provides user feedback, giving reasons whenever parallel com-
pilation fails. This avoids unexplained performance changes when utilising the
automatic dependency analysis, satisfying C5.
The sample code (Appendix D) used in the evaluation has been fully de-
scribed, meaning that all claims can be checked objectively. This fulfils the final
core requirement, C6.The implementation of simple automatic dependency checking (Section 3.6.2)
means that Extension E2 has also been completed.
5.2 Future Work
There are many additions and improvements that could be made to further de-
velop the compiler, of which I describe a few here.
5.2.1 Further Hardware SupportWith the release of NVIDIA’s new Fermi cards [21] and CUDA 3.0, it is now
possible to provide a more complete set of features for GPU execution, including
recursion and more complete object support. While recursion would be supported
automatically via nvcc, some features would require more work. Support for
allocations might be possible in some cases by pessimistically allocating space for
all possible allocations, and then freeing unused blocks after the kernel invocation.
Support for multiple graphics cards would also be useful. However, exporting
arrays back to Java, after different portions have been modified on different cards,
may cause difficulties, and extra overheads.
5.2.2 Further Optimisations
There is certainly scope for further transformations within the compiler to im-
prove performance. For example, when copying objects onto the device, it makes
sense only to copy fields that will be used. As mentioned in the original ex-
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 73/103
5.3. FINAL CONCLUSIONS 61
for(int i = 0; i < length; i++)
{if(arr[i] < minimum) minimum = arr[i];
}(a) Sequential min
...
<
0.30.2
<
1.7−3.1
(b) Parallel Reduction
Figure 5.1: Minimum finding algorithms
tensions, there may also be optimisations that neither nvcc nor the JVM can
perform, such as loop invariant code motion (see Section 2.1, E5).
As exhibited in the Game of Life benchmark (Section 4.2.5), support for multi-
pass loops (as implemented in [16]) could also improve performance dramatically
in some iterative algorithms.
5.2.3 Further Automatic Detection
Given the undecidability of the automatic parallelisation, there will always be
scope for introduction of more accurate and sophisticated tests. However, an
alternative might be to leave a CPU version of the code in the class, selecting
which to use at runtime. This could be based, not just on correctness, but also
on whether the number of iterations justify the expected overheads.
There is also the potential for ‘pattern matching’ transformations to yield
significant benefits (albeit in a limited number of cases). For example, common
implementations of minimum, maximum and sum (Figure 5.1a) are not suitable
for parallel execution, however, the solution can be sped up using parallel reduc-
tion (Figure 5.1b).
5.3 Final ConclusionsOverall, I believe that the compiler is able to offer a higher level of abstraction
than other attempts (Section 1.3) without sacrificing performance (Section 4.4).
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 74/103
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 75/103
Bibliography
[1] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers : Principles,
Techniques, & Tools, Second Edition . Addison-Wesley, second edition, 2007.
[2] B. Alpern, A. Cocchi, D. Lieber, M. Mergen, and V. Sarkar. Jalapeno-
a compiler-supported Java virtual machine for servers. In Workshop on
Compiler Support for Software System (WCSSS 99), volume 14, pages 87–
94. Citeseer, 1999.
[3] B. Amedro, V. Bodnartchouk, D. Caromel, C. Delbe, F. Huet, and
G. Taboada. Current State of Java for HPC. Technical Report RT-0353,
INRIA, 2008.
[4] A. Bik and D. Gannon. javab - A prototype bytecode parallelization tool. In
ACM Workshop on Java for High-Performance Network Computing , 1998.
[5] B. Boehm. A spiral model of software development and enhancement. SIG-
SOFT Softw. Eng. Notes, 11(4):14–24, 1986.
[6] E. Bruneton, R. Lenglet, and T. Coupaye. ASM: a code manipulation tool to
implement adaptable systems. Adaptable and extensible component systems,
2002.
[7] J. M. Bull, L. A. Smith, M. D. Westhead, D. S. Henty, and R. A. Davey.
A methodology for benchmarking Java Grande applications. In JAVA ’99:
Proceedings of the ACM 1999 conference on Java Grande, pages 81–88, New
York, NY, USA, 1999. ACM.
[8] L. Damas and R. Milner. Principal type-schemes for functional programs. In
POPL ’82: Proceedings of the 9th ACM SIGPLAN-SIGACT symposium on
Principles of programming languages, pages 207–212, New York, NY, USA,
1982. ACM.
63
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 76/103
64 BIBLIOGRAPHY
[9] P. Fleming and J. Wallace. How not to lie with statistics: the correct way
to summarize benchmark results. Communications of the ACM , 29(3):221,
1986.
[10] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design patterns: elements
of reusable object-oriented software. Addison-Wesley Reading, MA, 1995.
[11] M. Gardner. Mathematical games: The fantastic combinations of John Con-
ways new solitaire game Life. Scientific American , 223(4):120–123, 1970.
[12] R. Gosper. Exploiting regularities in large cellular spaces. Physica D Non-
linear Phenomena , 10:75–80, 1984.
[13] G. A. Kildall. A unified approach to global program optimization. In POPL
’73: Proceedings of the 1st annual ACM SIGACT-SIGPLAN symposium on
Principles of programming languages, pages 194–206, New York, NY, USA,
1973. ACM.
[14] A. Klockner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih. Py-
CUDA: GPU Run-Time Code Generation for High-Performance Computing.
Arxiv preprint arXiv:0911.3456 , 2009.
[15] W. Landi. Undecidability of static analysis. ACM Letters on Programming
Languages and Systems, 1(4):323–337, 1992.
[16] A. Leung, O. Lhotak, and G. Lashari. Automatic parallelization for graph-
ics processing units. In Proceedings of the 7th International Conference on
Principles and Practice of Programming in Java , pages 91–100, 2009.
[17] J. Lewis and U. Neumann. Performance of Java versus C++. Computer
Graphics and Immersive Technology Lab, University of Southern California,
Jan , 2003 (updated 2004).
[18] S. Liang. Java Native Interface 6.0 Specification . Sun, 1999.
[19] T. Lindholm and F. Yellin. The Java(TM) Virtual Machine Specification (2nd Edition). Prentice Hall, 1999.
[20] NVIDIA. Compute Unified Device Architecture. Programming Guide, Au-
gust 2009. Version 2.3.1.
[21] NVIDIA. Fermi: NVIDIA’s Next Generation CUDA Compute Architecture.
White paper, October 2009.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 77/103
65
[22] J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. Lefohn, and
T. Purcell. A survey of general-purpose computation on graphics hardware.
In Computer Graphics Forum , volume 26, pages 80–113, 2007.
[23] L. Smith and M. Bull. Java for High Performance Computing.
[24] H. Sutter. The free lunch is over: A fundamental turn toward concurrency
in software. Dr Dobb’s Journal , March 2005.
[25] Y. Yan, M. Grossman, and V. Sarkar. JCUDA: A Programmer-Friendly
Interface for Accelerating Java Programs with CUDA. In Proceedings of the
15th International Euro-Par Conference on Parallel Processing , 2009.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 78/103
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 79/103
APPENDIX ADataflow Convergence Proofs
In this appendix, proofs are given for the convergence of the iterative computation
of the various dataflow analyses, as described in Sections 2.7.1 to 2.7.4.
A.1 General Dataflow Analysis
As stated in Section 2.7.1, for an analysis over the complete lattice (X, ), with
transfer function F b : X → X , convergence is guaranteed if (X, ) is of finite
height and F b is monotone.
The proof is based on that given in [1, pp. 627 to 628], but is adjusted so
that each calculation makes use of the latest result, rather than always looking
to the previous iteration.
Definition 11. A function F : X → X is monotone if a b =⇒ F (a) F (b).
Theorem 2. If F b (for all b) is monotone and the lattice is of finite height, then
the dataflow analysis converges.
Proof. For all b, we consider the value of R(b) on the ith iteration (i.e. Ri(b)). If
we can show that ∀b.Ri(b) Ri+1(b), then the iterative calculation must convergesince we must either reach a fixed point, or all R(b) will eventually equal the upper
bound on the lattice (since the lattice has finite height, there are no infinite
chains).
For the case where children(b) = ∅, Ri(b) is constant, so trivially Ri(b) Ri+1(b). We consider the other cases by induction.
Base Case: Since we initialise R0(b) as ⊥, no matter what value R1(b) takes,
we have that R0(b) R1(b).
67
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 80/103
68 APPENDIX A. DATAFLOW CONVERGENCE PROOFS
Induction Step: Now we consider Ri+1(b), assuming ∀x.(Ri−1(x) Ri(x)).
Without loss of generality, we can also presume that there is an ordering of
calculations within the iteration, although this is not necessarily the same for
all iterations. We denote the set of blocks or instructions calculated before b as
calc(b). Using an inner induction proof, we can now show that ∀b.Ri(b) Ri+1(b).
Inner Base Case: For calc(b) = ∅, the value of Ri+1(b) is calculated as:
Ri+1(b) = F b
c∈children(b)
Ri(c)
We also know that Ri(b) was calculated as:
Ri(b) = F b
c∈children(b)
Ri(c)Ri−1(c)
By assumption and reflexivity, we have:
∀c ∈ children(b).Ri−1(c) Ri(c) Ri(c)
Therefore, since F b and both join and meet1 are monotone, we have that Ri(b) Ri+1(b) if calc(b) = ∅.
Inner Induction Step: Now we assume that ∀x ∈ calc(b).Ri(x) Ri+1(x).
In this case, Ri+1(b) is calculated as (using calc(b) as a partition):
Ri+1(b) = F b
c∈children(b)∩calc(b)
Ri+1(c)
∨
c∈children(b)\calc(b)
Ri(c)
Again since F b and both meet and join are monotone, and also by our assump-
tions, we have that Ri(b) Ri+1(b) if ∀x ∈ calc(x).Ri(x) Ri+1(x).
Therefore, using both the inner and outer inductions in turn, we have that
∀i,b.Ri(b) Ri+1(b). This proves that the iterative calculation converges.
A.2 Live Variable Analysis
Recall that live variable analysis is performed over the lattice (℘(Vars), ⊆) with
transfer function:
F n(x) = (x \ Write(n)) ∪ Read(n)
1It is a standard result of lattices that join and meet are monotone.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 81/103
A.3. CONSTANT PROPAGATION 69
Theorem 3. Iterative computation of liveness information converges.
Proof. Convergence can be shown with the help of Theorem 2 by showing that
F n is monotone and that the lattice has finite height. This is trivially the casesince F n is the composition of two monotone operations, set minus and set union.
Also, since ℘(Vars) is finite and has top element Vars, the lattice must have finite
height.
A.3 Constant Propagation
Recall that constant propagation is performed over the lattice ({⊥, } ∪Constants, ) and transfer function F n,v where:
x y ⇐⇒ (x = ⊥) ∨ (y = )
F n,v(x) =
c if n assigns c to v
if n writes a non-constant to v
x otherwise
Theorem 4. Iterative computation of constant propagation converges.
Proof. First we show that F n,v is monotone. The definition of F n,v can be con-
sidered in two cases. When n writes to v, F n,v is simply a constant function, sois trivially monotone. Equally, when n does not write to v, F n,v is the identity
function, so is also monotone.
We can also show that the lattice is finite. By definition of , the only
increasing chains are ⊥ and for each c ∈ Constant, ⊥ c .
Therefore, as for other dataflow analyses, convergence is guaranteed by these
two properties according to Theorem 2.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 82/103
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 83/103
APPENDIX BCode Generation Details
This appendix describes the naming conventions used within the code generation
stage of the compiler.
Within a kernel or method, the temporary variables used are simply named
consecutively (i.e. t1, t2, . . . ). For local variables, it is necessary to append
a type suffix since the same local variable might be used for different types in
different live ranges (as in Example 3.4). This gives names of vi TypeSort for
variable i where the type sort is any of the primitive types, or a unique number
for reference types.
Kernel launcher methods (i.e. those called as a replacement for the loops)are named using the hashcode of the internal object representing the kernel (i.e.
kernel <hashcode> or kernel M<-hashcode> if the value is negative). This
gives a unique name amongst the kernels exported, and is unlikely to conflict
with any methods within the original class.
JNI specifies a mangling scheme for converting Java method names to C++
[18, Table 2-1]. This must be adopted for the launcher, but is also used by the
compiler for method and static variable names with altered prefixes in place of
Java_ (Static_ for statics and none for methods). This is necessary to ensure
that there are not conflicts in naming (e.g. a naıve approach might result in
ClassX Test.f() and ClassX.Test f() both mapping to ClassX Test f).
71
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 84/103
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 85/103
APPENDIX CCommand Line Interface
The command line interface to the compiler has a number of optional arguments
that affect its behaviour. These are shown in the table below:
Option Description
cuda Directory into which the CUDA toolkit was installed,
should contain bin/nvcc.
jdk Directory into which the JDK was installed, should con-
tain an include directory with the JNI header files.
includes Directory in which the compilers include files are stored(parallel.h et al.).
library Name of the shared library that should be generated by
the compiler (defaults to libparallel).
classpath Paths (separated by :) in which input classes can be
found.
output Output directory for the shared library and modified
class files.
log Log level for feedback, accepting each of the Log4J pos-
sibilities.
detect Dependency checking method. This can be either manual
(default), auto or combined.
generate When specified, the shared library is not compiled and
the C++ code is saved.
nonportable Allows bytecode from the core Java class library to be
compiled onto the GPU. This may allow more code to be
compiled, but is not portable between library versions.
73
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 86/103
74 APPENDIX C. COMMAND LINE INTERFACE
Below an example of the compiler output is given, for automatic detection,
with the logging level set to INFO:
bing:dist$ java -jar Parallel.jar --log info --detect auto samples.Mandelbrot
INFO [core]: Considering samples/Mandelbrot.<init>(I)V
INFO [core]: Considering samples/Mandelbrot.compute()V
INFO [loops.detect]: Natural loop on line 62.
INFO [loops.detect]: Natural loop on line 63.
INFO [loops.detect]: Natural loop on line 73.
INFO [loops.trivialise]: Loop has multiple exit points (line 73).
INFO [loops.trivialise]: Trivial loop found (line 62): y#1 (I) <
READ ->samples/Mandelbrot.height [I] {y#1 (I)=1}
INFO [loops.trivialise]: Trivial loop found (line 63): x#2 (I) <
READ ->samples/Mandelbrot.width [I] {x#2 (I)=1}
INFO [check.Basic]: Accepted loop (line 62) based on basic test.
INFO [check.Basic]: Accepted loop (line 63) based on basic test.INFO [extract]: Kernel of 2 dimensions extracted (line 63).
INFO [extract]: Copy In: [Var#0 (Lsamples/Mandelbrot;), Var#2 (I), Var#1 (I)]
INFO [extract]: Copy Out: [Var#0 (Lsamples/Mandelbrot;)]
INFO [core]: Considering samples/Mandelbrot.main([Ljava/lang/String;)V
INFO [core]: Considering samples/Mandelbrot.output(Ljava/io/File;)V
INFO [loops.detect]: Natural loop on line 90.
INFO [loops.detect]: Natural loop on line 89.
INFO [loops.trivialise]: Trivial loop found (line 89): y#4 (I) <
READ ->samples/Mandelbrot.height [I] {y#4 (I)=1}
INFO [loops.trivialise]: Trivial loop found (line 90): x#5 (I) <
READ ->samples/Mandelbrot.width [I] {x#5 (I)=1}
INFO [check.Basic]: Alias analysis not accurate enough to judge loop (line 89).
INFO [check.Basic]: Alias analysis not accurate enough to judge loop (line 90).
INFO [core]: Considering samples/Mandelbrot.<init>(II)V
INFO [core]: Considering samples/Mandelbrot.run(I)J
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 87/103
APPENDIX DSample Code Used
This appendix gives further details on the sample code used in the evaluation.
D.1 Java Grande Benchmark Suite [7]
The suite is split into 3 distinct sections. The first concentrates on testing the
performance of “low level operations” such as arithmetic, and is not relevant to
this project. The second provides 7 kernel benchmarks, while the third concen-
trates on larger scale applications. A summary of the Section 2 benchmarks
available1 is given in Table D.1.
Benchmarks that could not be parallelised through use of parallel for loops
were not considered, since the goal was to use unmodified code.
1Version of 2.0 of the sequential suite was used.
Benchmark Description Used
Series Fourier coefficient analysis.
LUFact LU factorisation. ×SOR Successive over-relaxation.
×HeapSort Integer sorting. ×Crypt IDEA encryption.
FFT Fast Fourier transform. ×Sparse Sparse matrix multiplication. ×
Table D.1: Summary of Section 2 of the Java Grande Benchmark Suite.
75
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 88/103
76 APPENDIX D. SAMPLE CODE USED
Figure D.1: 3 generations of the Game of Life.
D.2 Mandelbrot Computation
A brief description of the Mandelbrot set is given in Section 4.2.4. The routine
used is from The Computer Language Benchmarks Game 2. Whilst the bench-
marks are now considered a bad way of comparing performance of languages,
they are still valid when comparing performance of different compilers (or run-
times) for a single language.The only modification made to the source code was to re-express the
do { ... } while(...); loop as a standard while(...) { ... } loop. This
allows trivialisation of the loop.
D.3 Conway’s Game of Life
Conway’s Game of Life is a cellular automaton. The evolution of each cell in a
2D grid is described by three simple rules (quoted from [11]), considered with
respect to the cell’s eight neighbours (an example evolution is given in Figure
D.1):
1. Survivals: “Every counter with two or three neighboring counters survives
for the next generation.”
2. Deaths: “Each counter with four or more neighbors dies (is removed)
from overpopulation. Every counter with one neighbor or none dies from
isolation.”
3. Births: “Each empty cell adjacent to exactly three neighbors – no more,
no fewer – is a birth cell. A counter is placed on it at the next move.”
The source code used for both the naıve algorithm and Hashlife is that devel-
oped by Dr Andrew Rice for use in a Java programming course 3.
2http://shootout.alioth.debian.org/3http://www.cl.cam.ac.uk/teaching/0809/ProgJava/
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 89/103
APPENDIX ETesting Gold Standards
The gold standard for loop trivialisation is given in the table below. Similar style
checks were made for both loop detection and kernel extraction.
Sample Details of Trivial Loops
Component Benchmarks
Base (Trigonometry) 27 (i < nums.length, i=+1), 34 (i < nums.length, i=+1), 43
(j < nums.length, j=+1)
Static (Statics) 28 (i < nums.length, i=+1), 35 (i < nums.length, i=+1), 44
(j < nums.length, j=+1)
Objects (Objects) 33 (i < nums.length, i=+1), 40 (i < nums.length, i=+1), 49(j < nums.length, j=+1)
2D (MultiDimension) 27 (k < nums.length, k=+1), 28 (l < nums[0].length,
l=+1), 36 (k < nums.length, k=+1), 37 (l <
nums[0].length, l=+1), 47 (i < nums.length, i=+1),
48 (j < nums[0].length, j=+1)
Java Grande Benchmarks
JGFCryptBench 48 (i < array rows, i=+1)
IDEATest 115 (i < 8, i=+1), 130 (j < array rows, j=+1), 154 (k < 5 2,
k=+1), 157 (k < 8, i=+1), 174 (i < 5 2, i=+1), 222 (i < 7,
i=+1), 273 (i < text1.length, i=+8, i1=+8, i2=+8), 291 (r
!= 0, j=-1)
JGFSeriesBench 56 (i < 4, i=+1), 57 (j < 2, j=+1)
SeriesTest 103 (i < array rows, i=+1), 169 (nsteps > 0, nsteps=-1)
Mandelbrot 62 (y < height, y=+1), 63 (x < width, x=+1), 89 (y <
height, y=+1), 90 (x < width, x=+1)
ReverseArray 16 (i < 3, i=+1), 20 (j < 3, j=+1), 24 (i < 3, i=+1)
The majority of benchmarks tested a range of the code generation features.
Since many benchmarks were represented by an object at the top level, this
77
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 90/103
78 APPENDIX E. TESTING GOLD STANDARDS
immediately tested object support. However, several benchmarks were used for
ensuring test coverage of other features:
Statics Tested support for static class fields.
MultiDimension Tested support for arrays, and arrays of arrays.
ReverseArray Tested support for manipulation of references on the GPU.
Objects Tested support for full use of objects, involving modification of multiple
classes and invoking instance methods.
Testing of the automatic dependency analysis could be done against the
@Parallel annotations that were already in place to mark parallel loops.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 91/103
APPENDIX FClass Index
SLOC1 Class Name Relevant Sections
105 analysis.dataflow.Dataflow
57 analysis.dataflow.ReachingConstants 2.7.4
153 analysis.dataflow.LiveVariable 2.7.3, 3.2.4, 3.3.1
330 analysis.dataflow.AliasUsed 3.3.3, 3.3.4
71 analysis.dataflow.IncrementVariables 3.3.2
131 analysis.dataflow.SimpleUsed 3.3.4
7 analysis.dependency.DependencyCheck 3.6
174 analysis.dependency.BasicCheck 3.6.2
32 analysis.dependency.AnnotationCheck 3.6.123 analysis.dependency.CombinedCheck
104 analysis.loops.LoopDetector 2.7.2
141 analysis.loops.LoopTrivialiser 3.4.1
40 analysis.loops.LoopNester 2.7.2
16 analysis.AliasMap
35 analysis.BlockCollector
72 analysis.CanonicalState ‘State’ in 3.3.3
17 analysis.CodeTraverser 3.2.2
25 analysis.InstructionCollector
154 analysis.KernelExtractor 3.5
71 analysis.LooseState ‘LooseState’ in 3.3.3
80 bytecode.AnnotationImporter
119 bytecode.BlockExporter 3.2.3
462 bytecode.InstructionExporter 3.2.3
99 bytecode.ClassImporter
76 bytecode.ClassExporter
626 bytecode.MethodImporter 3.2.3
81 bytecode.ClassFinder
320 cuda.Helper Appendix B
258 cuda.CppGenerator 3.7.1
79
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 92/103
80 APPENDIX F. CLASS INDEX
SLOC1 Class Name Relevant Sections
108 cuda.BlockExporter 3.7.1
182 cuda.CUDAExporter 3.7
40 cuda.Beautifier 3.760 debug.ControlFlowOutput e.g. Example 3.8
20 debug.LinePropagator
32 exceptions.UnsupportedInstruction
1108 graph.instructions.* Table 3.1
10 graph.state.State
57 graph.state.ArrayElement
69 graph.state.Variable
75 graph.state.Field
49 graph.state.InstanceField
36 graph.Annotation 3.2
85 graph.BasicBlock 3.2.1
64 graph.Block 3.2.1
22 graph.BlockVisitor 3.2.2
152 graph.ClassNode 3.2
39 graph.CodeVisitor 3.2.2
71 graph.Kernel
29 graph.Loop 3.2.1
111 graph.Method 3.2
123 graph.Modifier 3.2
32 graph.TrivialLoop 3.2.1, 3.4.1
254 graph.Type 3.2.4
36 tools.Benchmark
202 tools.Parallelise 3.89 tools.Restrict
10 tools.Parallel
51 util.Utils
28 util.EquatableWeakReference
23 util.ConsList
16 util.MapIterable
25 util.Tree
54 util.WeakList 3.2.1
23 util.TransformIterable
10 parallel.h
50 parallel/launch.h 3.7.2
24 parallel/types.h212 parallel/memory.h 3.7.3
206 parallel/transfer.h 3.7.3
7686
1As calculated by SLOCCount—http://www.dwheeler.com/sloccount/.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 93/103
APPENDIX GSource Code Extract
1 / ∗2 ∗ P a r a l l e l i s i n g JVM C o m pi l er
3 ∗ P a rt I I P r o j e ct , C om pu te r S c i e n c e T r i po s
4 ∗5 ∗ C o py r ig h t ( c ) 2 00 9 , 2 01 0 − P e t er C a l v e rt , U n i v e r s i t y o f C am br id ge
6 ∗/
78 package an al y s is . d ep e n d e nc y ;
910 import graph . Annotation ;
11 import graph . Method ;
12 import grap h . Triv ia lL oo p ;
13 import graph .Type ;1415 import j a v a . u t i l . C o l l e c t i o n s ;
16 import j a v a . u t i l . L i s t ;
1718 import org . apache . lo g4 j . Logger ;
1920 / ∗∗21 ∗ C he ck s d e p e n d e n c i e s b a s ed on a n n o t a t i o n s o n t h e c o n t a i n i n g m et ho d .
22 ∗/
23 p ub li c c l a s s AnnotationCheck implements DependencyCheck {24 / ∗∗25 ∗ Names o f l o op i n d i c i e s t h a t s h ou l d b e r un i n p a r a l l e l i n t h e c u r re n t
26 ∗ c o n t e x t .
27∗
/
28 private L i s t<S t r i n g> l o o p I n d i c e s ;
2930 / ∗∗31 ∗ S e ts t h e c o n te x t i n w hi ch l o o ps s h ou l d b e c o ns i de re d .
32 ∗33 ∗ @param m e th od M et ho d i n w h ic h l o o p s t h a t f o l l o w a r e c o n t a i n ed .
34 ∗/
35 @Override
36 p u b lic v oid set Con tex t (Method method) {37 A n n otation an n ot atio n = method . ge tA n n ota tion (
38 T ype . g e t O b j e ct T y p e ( ” t o o l s / P a r a l l e l ” )
81
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 94/103
82 APPENDIX G. SOURCE CODE EXTRACT
39 ) ;
4041 i f ( an n ot atio n == n u ll ) {42 l o o p I n d i c e s = C o l l e c t i o n s . e m pt y Li s t ( ) ;
43 } e l s e {44 l o o p I n d i c e s = ( L i s t<S t r i n g >) a n n o t a t i on . ge t ( ” l o o p s ” ) ;
45 }46 }4748 / ∗∗49 ∗ C hec ks w he th er i t i s s a f e t o e x ec u te t h e g i v e n <code>T r i v i a l L o o p</code> i n
50 ∗ p a r a l l e l b as ed on t h e name o f t h e l o op i n de x .
51 ∗52 ∗ @param l o op T r i v i a l l o op t o c h ec k .
53 ∗ @return <code>t r u e </code> i f s af e t o r un i n p a r a l le l ,
54 ∗ <code> f a l s e </code> o t h e r w i s e .
55 ∗/
56 @Override
57 public boolean c h e c k ( T r i v i a l L o o p l o o p ) {58 i f ( loop I n d ic e s . c on t ain s ( loop . ge tI n d e x () .ge tName () ) ) {59 L o gg e r . g e t L o gg e r ( ” a n n o t a t i o n ” ) . i n f o ( ” A cc ep t ed ” + l o o p + ” f o r
p a r a l l e l i s a t i o n . ” ) ;
60 return tru e ;
61 } e l s e {62 L o gg e r . g e t L o gg e r ( ” a n n o t a t i o n ” ) . i n f o ( ” R e j e c t e d ” + l o o p + ” f o r
p a r a l l e l i s a t i o n . ” ) ;
63 r et ur n f a l s e ;
64 }65 }66 }
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 95/103
APPENDIX HProject Proposal
Peter Calvert
Trinity College
prc33
Computer Science Tripos Part II Individual Project Proposal
Parallelisation of Java for Graphics Processors
October 22, 2009
Project Originator: Peter Calvert
Resources Required: See attached Project Resource Form
Project Supervisors: Dr Andrew Rice and Dominic Orchard
Signatures:
Directors of Studies: Dr Arthur Norman and Dr Sean Holden
Signatures:
Overseers: Dr David Greaves and Dr Marcelo Fiore
Signatures:
83
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 96/103
84 APPENDIX H. PROJECT PROPOSAL
Introduction and Description of the Work
In the past, improvements in computational performance have taken the form
of higher clock speeds. However, more recently, increased performance has come
from the use of multiple processors, to solve independent parts of a problem in
parallel.
Graphics processors (GPUs) are a good example of this, and are commonly ar-
chitected as stream processors, meaning that they can apply the same set of
instructions across a grid in parallel. As a result of this, there has been signifi-
cant recent interest in using them for more general computation. In particular,
they are suited to running loops in parallel.
However, it is a well known problem that developers find it hard to reason about
the interactions of code running in parallel. Furthermore, most existing code issequential, and thus there are no performance gains from executing it on parallel
architectures. It must be recompiled, or in some cases rewritten, to benefit.
Automatic parallelisation aims to address this by analysing existing sequential
code, and identifying areas that can be run in parallel.
This project aims to make it possible to utilise parallel processors by compiling
appropriate loops for GPU execution. Initially, developer input will be required
to determine whether the conversion maintains correctness. However, as the
project develops, it is hoped that some of these decisions can be automated. The
project will be evaluated both by the performance gain resulting from parallel
computation, and also by the scope of the analyses made.The compilation will be made from Java Virtual Machine (JVM) bytecode, since
it is possible to compile a number of languages1 for it (including Ruby, Python
and Scala). It is also relatively simple, and libraries exist to aid in its analysis 2.
The Low Level Virtual Machine (LLVM) would have been a viable alternative
for similar reasons, but was dismissed due to lack of familiarity.
The target of the compilation will be NVIDIA’s devices, due to the complete
framework (CUDA) that they have made available to allow GPU kernels to be
written along side CPU code, which will make development easier. A more stan-
dardised approach, OpenCL, is still at the draft stage.
While in general determining whether a loop’s iterations are independent is unde-
cidable, there are solutions given certain constraints which could be introduced.
There are also transformations that could be applied beforehand to remove some
dependencies. A major difficulty often experienced is related to checking whether
variables are aliasing, so this will be left in as a check for the user to make. These
1http://en.wikipedia.org/wiki/List_of_JVM_languages2ASM(http://asm.ow2.org/)
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 97/103
85
automatic extensions could be evaluated in terms of the accuracy of their analysis,
and also the proportion of loops in sample code that they can consider.
Resources Required
Access will be needed to a suitable graphics processor that supports the NVIDIA
CUDA architecture. However, during development it will be possible to use the
emulation mode included in the NVIDIA development tools.
Starting Point
This project will be undertaken starting from the following knowledge and expe-rience:
• General knowledge of JVM bytecode from Part Ib course Compiler Con-
struction .
• Successful compilation and execution of a couple of CUDA examples under
the emulation environment.
• Rudimentary code put together during the first week of Michaelmas term
that produces an unrefined graph of JVM code using the ASM library, and
then detects loops in this.
• Preliminary reading over the long vacation into compiler optimisation tech-
niques and dependency analysis.
Further knowledge will be gained during Michaelmas term of Part II from the
Optimizing Compilers course.
Substance and Structure of the Project
In order to allow any compilation or analysis to occur, the Java bytecode must
first be read in and represented in a suitable structure for both control and data
flow analysis. This will be a graph of basic blocks, within each of which a data
flow graph will be contained. To allow the compiler, analysis and transformers
to traverse the structure, a variant of the visitor pattern should be implemented.
The project can then be divided into the following stages, starred items are being
considered as possible extensions rather than core parts:
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 98/103
86 APPENDIX H. PROJECT PROPOSAL
1. Detection of loops within the control flow graph (JVM bytecode represents
control flow in an unstructured manner) and insertion of the appropriate
‘loop’ nodes. This can be done using analysis of each basic block’s domi-
nators.
2. Wrappers that can transfer the various JVM primitive types and arrays
to the GPU. This would be done using Java’s native code interface (JNI).
At this stage it is also necessary to be able to invoke the kernels over the
required dimensions, converting these into a suitable grid of blocks for the
size of GPU available.
3. Compilation of loop bodies for execution on a NVIDIA CUDA compatible
GPU. Since NVIDIA already provide a C compiler for this, the simplest
approach here is to generate C code from JVM bytecode.
4. Detection of which variables need to be passed into the CUDA kernel.
5. Transformation of the Java class to use the relevant wrappers in place of
the original loop code.
* Automatic detection of the loop variable and its bounds rather than
prompts to the user. This will be characterised by the variable that is
used in the exit condition, and which is also only written to by a single
INCREMENT instruction on each iteration (this instruction also accepts neg-
ative increments for the case of a decrementing loop).
* Basic dependency analysis of variable and field usage, for array accesses the
relatively simple GCD test should be used (allowing analysis where array
usages are of the form ax + b).
* Support for compiling object oriented JVM code to CUDA C.
* Loop-invariant code motion: this is a common optimisation that is used
by all compilers, however, since the code is being split and passed to two
separate compilers, there is no scope for code to be moved from inside the
loop, to the outside.
* Runtime checks for aliases and regular shaped arrays.
* A constrained version of loop fission (or loop distribution ) in which we
require that the loop body does not contain conditional blocks (i.e. just
sequential instructions and nested loops). This splits existing loops into
multiple loops, so that at least some of these can be run in parallel, even if
the combined loop could not.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 99/103
87
Using existing code from a benchmark suite3 as well as other code that can be
sourced, an evaluation will then be drawn up on the performance gains that can
be achieved. Additionally, these gains will be compared with those made by
hand-written parallel versions of some of the benchmarks. The success of the
automatic checks at detecting safe loops will also be evaluated. Where safe loops
were not detected as such, it will be noted (when obvious) what further analysis
or transformations may have helped. This could then be used to guide any future
work.
Success Criteria
The core parts of the project will have been a success if:
1. Existing Java code (that has had GPU areas manually marked) can be run
using CUDA hardware, producing the same results.
2. The performance of CUDA-enabled benchmarks can be compared to their
original running time, and also to the running time when the conversion to
CUDA code is done by hand.
3. In some cases, an overall speed-up can be found. However, this will not
always be possible due to the transfer overhead associated with using the
GPU. Given sufficient large problem sizes, this overhead should becomenegligible.
The automatic detection extension to the project will have been a success if
common dependency analysis techniques can be evaluated based on their ability
to detect loops that are safe for parallelisation.
Timetable and Milestones
The timeline below is structured into 2 week ‘slots’. In allocating work to slots,
there were several aims in mind:
• To have a general structure in place that allows independent testing of
separate components as early as possible.
3Java Grande (http://www2.epcc.ed.ac.uk/computing/research_activities/java_
grande/sequential.html)
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 100/103
88 APPENDIX H. PROJECT PROPOSAL
• To attempt the most difficult and risky parts of the project early on, so
that there is plenty of recovery time if problems do arise.
• To implement all required features and evaluate these before extensions areincorporated.
• To write a draft dissertation as the work is done, rather than leaving it as
a big job for the end.
Slot 0: 1st October to 16th October
• Discuss with Researchers, Overseers and Director of Studies the feasibility
of the project idea, along with background reading to assess the existing
work in the area, and the quantity of work entailed.
• Arrange with Project Supervisors a schedule of meetings to ensure the
project stays on track.
• Organise access to equipment for the project (i.e. a capable computer with
CUDA GPU), as well as setting up a regular backup system.
Milestones: Project proposal and availability of CUDA GPU.
Slot 1: 17th October to 30th October• Experiment with CUDA and gain familiarity with what it can do.
• Rework preliminary flow graph producing code, taking more care over the
data structure.
• Based on the algorithms being used, implement traversal facilities for the
flow graph that give easy access to the relevant information and structure.
• Rework the preliminary loop detection code using the structure from above.
Milestone: Be able to read in JVM class files and represent both the controlflow and data flow inherent in them, recovering loop structures.
Slot 2: 31st October to 13th November
• Produce code that can transfer primitive Java types and also arrays onto a
GPU.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 101/103
89
• Produce code that can invoke a compiled CUDA kernel from Java.
Milestone: Implementation of all required CUDA wrappers in JNI.
Slot 3: 14th November to 27th November
• Produce code that can detect which variables need to be transferred to and
from the GPU for a given block of code.
• Produce code that generates valid CUDA C for a given section of JVM
bytecode.
Milestones: Be able to detect which variables need to be transferred to and
from the GPU, and be able to generate CUDA C from bytecode.
Slot 4: 28th November to 11th December
Use this time to consolidate and tidy up any loose ends in the code, and test it
on a wider range of JVM bytecode.
Due to end of term events and also a ski holiday (4th to 13th December), less
work has been scheduled for this slot.
Slot 5: 12th December to 25th December
• Tie components together to be able to produce rewritten class files that
invoke GPU kernels rather than the original loops.
• Start drafting a dissertation for the core parts of the project, using notes
made whilst this was implemented in slots 1 to 3.
Milestones: Core implementation complete, and dissertation with most struc-
ture drafted along with content for the core preparation/implementation.
Slot 6: 26th December to 8th January• Catch up time to fix non-critical bugs that have been put off during previous
slots.
• Source as many benchmarks and suitable applications written in JVM lan-
guages as possible (ideally containing a couple of hundred loops in total
across all the code).
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 102/103
90 APPENDIX H. PROJECT PROPOSAL
• Work out safe loops in the benchmark code collected.
•Evaluate the performance improvements from the CUDA compilation for
the benchmark code.
Milestone: Extensive set of benchmarks for CPU and CUDA versions.
Slot 7: 9th January to 22nd January
• Manually produce CUDA versions of some of the benchmarks, and add the
performance of these to the evaluation.
• Start writing the evaluation section of dissertation based on the results.
• Decide on whether to implement extensions, and if so how much of theautomated detection to attempt.
Slot 8: 23rd January to 5th February
• Prepare the required progress report and the accompanying presentation.
• Work on extensions / catch up.
Milestones: Progress report and presentation.
Slot 9: 6th February to 19th February
Further extensions and catch up time.
Milestone: Complete code base.
Slot 10: 20th February to 5th March
Update the dissertation with details of any extension work, and prepare it to
draft standard (based on the work already achieved).
Milestone: Complete draft dissertation.
Slot 11: 6th March to 19th March
End of Lent term / Easter holiday, emphasis on revision.
Slots 12 and 13: 20th March to 16th April
Easter holiday, emphasis on revision.
7/31/2019 2010 Javagpu Diss
http://slidepdf.com/reader/full/2010-javagpu-diss 103/103
91
Slot 14: 17th April to 30th April
This coincides with the beginning of Easter term. This time will be spent final-
ising the dissertation, and proof reading.Milestone: Printed dissertation ready to hand in.
Slot 15: 1st May to 14th May
This slot ends with the final deadline for the dissertation. It is intended that
this slot won’t be used, and therefore it provides some buffer time for any serious
issues.