GPU Accelerated Linear Algebra

Table of contents

Table of contents .............................................................................................. i

List of figures................................................................................................... v

List of tables................................................................................................... vi

Summary .........................................................................................................1

1 Introduction ..............................................................................................3

1.1 Motivation ...........................................................................................3

1.2 Reading guide ......................................................................................3

1.3 Problem definition .................................................................................5

1.4 Method ...............................................................................................6

1.5 Scope.................................................................................................6

1.5.1 Algorithms ....................................................................................6

1.5.2 Numerical stability ..........................................................................6

1.5.3 IEEE 754 and double-precision ............................................................7

1.5.4 BLAS ............................................................................................7

2 Background ...............................................................................................8

2.1 Linear algebra ......................................................................................8

2.2 GPU computing .....................................................................................8

3 Parallel platforms..................................................................................... 10

3.1 Cuda ................................................................................................ 10

3.1.1 History ....................................................................................... 10

3.1.2 Version....................................................................................... 10

3.1.3 Cuda program .............................................................................. 11

3.1.4 Architecture ................................................................................ 13

3.1.5 Limitations .................................................................................. 16

3.2 GPU.NET ........................................................................................... 17

3.2.1 Overview .................................................................................... 17

3.2.2 Development ............................................................................... 18

3.2.3 Execution ................................................................................... 19

3.2.4 Limitations and bugs ...................................................................... 19

3.2.5 Evaluation ................................................................................... 20

4 Hardware platform ................................................................................... 21

4.1 Analysis ............................................................................................ 22

4.2 Benchmarking .................................................................................... 23

4.2.1 Memory performance ..................................................................... 23

4.2.2 Arithmetic performance .................................................................. 24

5 Implementation ....................................................................................... 26

5.1 Development environment ..................................................................... 26

5.2 Design decisions .................................................................................. 26

5.3 Optimisation ...................................................................................... 27

5.3.1 Strategy ..................................................................................... 27

6 Matrix-multiplication ................................................................................ 29

6.1 Analysis ............................................................................................ 30

6.1.1 The sequential algorithm ................................................................ 30

6.1.2 Parallelism .................................................................................. 31

6.2 Simple algorithm ................................................................................. 32

6.2.1 The algorithm .............................................................................. 32

6.2.2 Test and results ............................................................................ 33

6.3 Optimisation ...................................................................................... 36

6.3.1 Unroll loop with threads ................................................................. 36

6.3.2 Tiling v1 ..................................................................................... 38

6.3.3 Tiling v2 with latency hiding ............................................................ 41

6.3.4 Tiling v3 with prefetching ................................................................ 42

6.3.5 Tiling v4 and v5 with more output per thread ....................................... 43

6.3.6 Cuda compute capability ................................................................. 45

6.4 Evaluation ......................................................................................... 46

7 LU decomposition ..................................................................................... 48

7.1 Analysis ............................................................................................ 48

7.1.2 Parallelism .................................................................................. 49

7.2.1 The algorithm .............................................................................. 51

7.2.2 Test and results ............................................................................ 53

7.3 Block LU-decomposition ........................................................................ 55

7.3.1 The block algorithm ....................................................................... 55

7.3.2 Implementation ............................................................................ 56

7.3.3 Test and results ............................................................................ 59

7.3.4 Optimising round 1 ........................................................................ 61

7.3.5 Test and results ............................................................................ 63

7.3.6 Optimising round 2 ........................................................................ 64

7.3.7 Further optimisation ...................................................................... 65

7.3.8 Large matrices ............................................................................. 68

7.4 Evaluation ......................................................................................... 69

8 QR decomposition ..................................................................................... 71

8.1 Analysis ............................................................................................ 71

8.1.2 Parallelism .................................................................................. 73

8.2.1 The algorithm .............................................................................. 74

8.2.2 Test and results ............................................................................ 75

8.3 Optimisation ...................................................................................... 76

8.3.1 Test and results ............................................................................ 76

8.4 Block QR-decomposition ........................................................................ 77

8.4.1 The block algorithm ....................................................................... 77

8.4.2 Implementation ............................................................................ 79

8.5 Evaluation ......................................................................................... 80

9 Evaluation .............................................................................................. 81

9.1 Cuda ................................................................................................ 81

9.2 GPU.NET ........................................................................................... 82

10 Discussion and future work .......................................................................... 83

10.1 Project ............................................................................................. 83

10.2 Cuda ................................................................................................ 83

10.3 Hardware .......................................................................................... 84

10.4 Future of GPGPU ................................................................................. 84

11 Conclusion .............................................................................................. 86

Bibliography and references ............................................................................... 87

Appendix A – Project evaluation .......................................................................... 89

Appendix B – Implementation considerations .......................................................... 90

Cuda thread organisation ............................................................................... 91

SIMT and warp size ....................................................................................... 93

Elapsed time ............................................................................................... 93

Pinned or page-locked memory ........................................................................ 94

Matrix structure ........................................................................................... 94

Appendix C – Hardware specification description and analysis ..................................... 95

Platform #1 ............................................................................................. 95

Platform #2 ............................................................................................. 96

Platform #3 ............................................................................................. 96

Platform evaluation ................................................................................... 97

Specifications .......................................................................................... 98

Evaluation ............................................................................................. 101

Appendix D – Development environment problems and solution model .......................... 102

Development model ..................................................................................... 102

Cuda C and C++ ....................................................................................... 102

Appendix E – CGMA and Cuda profiler .................................................................. 104

Appendix F – Matrix-multiplication CC levels ......................................................... 106

Appendix G – Report page count ......................................................................... 107

List of figures

Figure 1 - Cuda program sequence diagram ............................................................. 12

Figure 2 - Cuda architecture with four multiprocessors .............................................. 14

Figure 3 - How GPU.NET works as describe on TidePowerd.com .................................... 17

Figure 4 – Simplified diagram of a chipset .............................................................. 22

Figure 5 - Matrix-multiplication process depicted ..................................................... 29

Figure 6 – The output of the console testing program ................................................ 34

Figure 7 - Performance of kernels executed for different CC levels on platform #4 ............ 45

Figure 8 – Performance of simple LU-decomposition on different platforms. .................... 53

Figure 9 – Matrix A being decomposed by block LU-decomposition in steps. ..................... 56

Figure 10 - Performance of block LU-decomposition v1 on different platforms. ................. 60

Figure 11 - Computing time of each kernel in block LU-decomposition v1 on platform #4. ... 61

Figure 12- Performance of block LU-decomposition v2 on different platforms. ................. 63

Figure 13 - Computing time of each kernel in block LU-decomposition v2 on platform #4. ... 64

Figure 14 - Performance of block LU-decomposition v3 on different platforms. ................. 65

Figure 15 – Showing the sub-matrix part of the triangular solve method. ......................... 66

Figure 16 – A 10.000 x 10.000 matrix LU-decomposed on platform #3 and #4. ................... 67

Figure 17 - Peak performance of LU-decomposition v3 on platform #3 ............................ 68

Figure 18 - Storage strategy for the compressed Householder QR-factorisation ................. 72

Figure 19 - Matrix A being decompose by block QR-decomposition in steps. ..................... 78

Figure 24 - Cuda thread organisation [4] ................................................................ 90

Figure 20 – Block diagram of a chipset. Source: Intel ................................................. 98

Figure 21 - CPU and bus details of platform #1 ........................................................ 99

Figure 22 - System memory details of platform #1 ................................................... 100

Figure 23 - GPU details of platform #1 .................................................................. 101

List of tables

Table 1 - Hardware specifications for the four platforms ........................................... 21

Table 2 - Measured bandwidth of Cuda memory transfer operations .............................. 23

Table 3 - Measured gigaflops performance of GPU .................................................... 25

Table 4 - Test result of outer loops matrix-multiplication on platform #1 ........................ 34

Table 5 - Test result of outer loops matrix-multiplication no structure on platform #1 ........ 36

Table 6 - Test result of matrix-multiplication for resulting matrix on platform #1 .............. 37

Table 7 - Test result of matrix-multiplication for tiling strategy on platform #1 ................ 40

Table 8 - Tiling with 2 and 4 outputs per thread comparison for different platforms .......... 44

Table 9 - Kernel invocation overhead ratio of total running time .................................. 54

Table 13 - Cuda built-in variables ........................................................................ 91

Table 10 - GPU specifications for Nvidia GeForce 9400m, platform #1 ............................ 96

Table 11 - GPU specifications for Nvidia GeForce 8800 GS, platform #2 .......................... 96

Table 12 - GPU specification for Nvidia Tesla C1060, platform #3.................................. 97

Table 14 - Selected profile counter from Compute Visual Profiler User Guide .................. 105

Summary

The purpose of this project was to uncover characteristics, features and limitations of the

Cuda architecture. An optimisation strategy was formed, containing methods and techniques

that supposedly enabled increased performance.

Three frequent used linear algebra algorithms for matrix-multiplication, and LU- and QR-

decomposition was chosen. These algorithms were then implemented as a simple version, and

performance and correctness test were performed. GPU.NET was used as a frame of

reference where applicable.

The optimisation strategy was used to improve the performance of the implemented

algorithms. It was found that a linear block algorithm could achieve better performance, than

a regular algorithm.

The main output from this project was a list of recommendations and experiences from the

tests performed on the linear algebra algorithms. The findings from this project suggested

that tiling was the best strategy, followed by latency hiding and coalescing memory access,

when optimisation was the goal.

In addition to the points above, this list describes recommendations based on the testing:

• Avoid using structures as parameters in the kernel definitions, use instead simple

types or pointers thereof.

• Target the highest possible Compute Capability level. Among other things, the

precision of instructions are better and the result will be more accurate.

• Unroll loops, by making the threads fine-grained. Generation and thread scheduling

are cheap.

• Thread block size should be a multiple of the warp size (Currently 32).

• Be aware of the overhead for invoking a kernel.

• Note that default instructions deviate from IEEE 754, use specific IEEE 754 functions

for increased precision, but at the cost of speed.

Besides the list and suggestions above, there were also methods with doubtful results:

• The Volkov suggestion yielded performance gains on some systems, but lower on

others. Can be useful for low occupancy kernels, but should be tested and evaluated.

• Data prefetching can both increase and lower performance.

It was pointed out that the underlying hardware and its capabilities played an important role

to whether an optimisation technique affected performance. Some methods had positive

effect on some GPUs, and a negative on others. Analysing and testing should therefore always

be performed.

The purpose described in the problem definition was achieved, and the learning goals were

reached with satisfaction.

1 Introduction

This 30 ECTS thesis project has been produced by Mikkel Bundgaard-Ovesen from the 1st

February 2011 to 1st August 2011, on the ITU Copenhagen. The project builds on the results

from the report ”Documentation of the GPUs usability in advanced parallel calculations” [1],

and has been supervised by Peter Sestoft.

1.1 Motivation

The speed of computers has increased over the years as a result of increased demand for

processing power. The CPU has from the beginning, been the preferred architecture for

computing. But during the last decade, an additional computing architecture has evolved,

namely the graphic processing unit (GPU). A GPU, also called a massively parallel processor,

offer tremendous performance in gigaflops, at a relatively low cost.

Different parallel computing architectures, such as Nvidia’s Compute Unified Device

Architecture (Cuda), Open Computing Language (OpenCL) and Microsoft’s DirectCompute have

been developed to serve as a platform for general purpose programming on the GPU (GPGPU),

to enable massively parallel programs.

Utilising the immense GPU power is not a trivial task. The execution model of the GPU is

becoming more and more flexible, but being a SIMD model puts restrictions on utilisation.

Data-parallel algorithms that have a simple execution path, and high arithmetic intensity are

usually well suited for processing by the GPU architecture. But, there are indications that

other algorithms, which do not share these characteristics, in fact can be optimised in a way,

such that they are accelerated by the GPU.

The huge performance offered at a relatively low cost, makes it interesting to find out how

this power can be harnessed. In this project, I will look into how the linear algebra operation

matrix-multiplication and the decompositions LU and QR can be implemented and optimised,

on the Nvidia Cuda architecture.

1.2 Reading guide

This report is addressed to persons with interest for GPGPU. The report assumes that the

reader has knowledge of C and development experienced. Knowledge of linear algebra and

the algorithms would be beneficial.

References in the report are shown in the text as [number], and the reference can be found

in the bibliography and references list.

The report is divided into 11 chapters.

The first chapter describes the purpose and the goals of the project.

The second chapter gives a short introduction to linear algebra and GPU computing history,

readers can skip this chapter.

The third chapter describes the parallel platforms Cuda and GPU.NET. The paragraphs 3.1.4

and 3.1.5 are the most important.

Chapter 4 focuses on the hardware platform and its influence on performance. The different

development and test systems are described, and the importance of the chipset’s North

Bridge is described. The paragraph 4.2.2 holds a description of the CGMA term, and an

analysis is performed.

Chapter 5 describes the development environment together with some design decisions. The

most important section is 5.3 that holds the optimisation strategy used later.

Chapter 6 analyses the matrix-multiplication algorithms, and describes its implementation

and optimisations. The results of the improvements are found throughout the chapter, but an

evaluation can be found in paragraph 6.4.

Chapter 7 deals with LU-decomposition. The chapter describes the algorithm, its

implemented and optimisation, along with the test results. An evaluation can be found in

paragraph 7.4.

Chapter 8 looks into QR-decomposition. The algorithm is described and analyses, after which

it is implemented, improved and tested. Results are found throughout the chapter, but an

evaluation can be found in paragraph 8.5.

Chapter 9 tries to summarise results from all three algorithms, and compare them with the

initial optimisation strategy. This chapter is important and serves as an evaluation and

conclusion.

Chapter 10 looks at the work done so far, and discusses possible extensions to the project. A

broader perspective is also discussed, looking into Cuda, hardware and GPGPU in general.

Chapter 11 is solely the conclusion to the problem definition, for an evaluation on the

projects results, please refer to chapter 9.

1.3 Problem definition

The purpose of this project has changed during the project period. Initially the focus was

firstly to identify linear algebra algorithms suitable for implementation on the parallel Cuda

architecture, secondly in the process of implementing these algorithms, trying to understand

the Cuda architecture. In reality, my supervisor Peter Sestoft and I agreed to focus on three

frequent used linear algebra algorithms for matrix-multiplication, and LU- and QR-

decomposition. By having these algorithms selected in advance, this project can focus on the

core objective, to uncover characteristics and features of Cuda.

The following statement and the elaborating points reflect this clarification:

Firstly, implement linear algebra algorithms for matrix-multiplication, LU-

decomposition and QR-decomposition and evaluate their performance on the

parallel GPU Cuda architecture. Secondly, to analyse, test and describe

different optimisation techniques relevant to the Cuda implementations, and

furthermore describe how they may be used in general.

• Describe the linear algebra algorithms and their characteristics.

• Describe the Cuda architecture and development platform, uncovering

characteristics, features, problems and a future outlook.

• Analyze and implement the linear algebra algorithms on the Cuda platform. Describe

how an implementation can be performed including any benefits as well as

limitations.

• Analyze and document optimisation techniques for the algorithm implementations on

• Perform correctness and performance test.

Learning goals

• Knowledge of linear algebra and linear algebra algorithms

• Understanding of the Cuda architecture and platform

• Obtaining skills in C/C++ and Cuda C

• Ability to implement linear algebra algorithms using C/C++ and C for Cuda

1.4 Method

1. Study literature on linear algebra, C and C++ and Cuda architecture development.

2. Implement basic versions of linear algebra algorithms in C/C++ and C for Cuda using

Visual Studio and Nvidia Nsight. Develop tests and benchmarks and compare results

with comparable CPU implementations.

3. Implement optimisations for the algebra algorithms and compare results with CPU

implementations.

As mentioned before, this thesis builds on the experiences and results of the project

”Documentation of the GPUs usability in advanced parallel calculations” [1]. One of the goals

of that project was to uncover how the GPU could be utilised from .NET. This is not a specific

goal for this thesis; however, I regard it an important perspective.

During the thesis research period, I discovered GPU.NET by TidePowerd, a framework and tool

whose main feature is to bridge Cuda and .NET. In this project, GPU.NET will be used, where

it makes sense, to compare algorithm implementations and their performance with the pure

Cuda implementations.

It will be interesting to see how GPU.NET performs compared to pure Cuda C/C++, and

furthermore, whether GPU.NET is easier to.

Testing the correctness of algorithms in both GPU.NET and Cuda will be compared to results

computed by the CPU.

1.5 Scope

All areas of this project cannot be analysed and documented, prioritising is important so the

parts that are processed is done with adequate depth.

1.5.1 Algorithms

This project is an empirical study that should document implementation, optimisation and

performance of existing linear algebra algorithms on the Cuda platform. It is not part of this

project to develop new algorithms, but merely to base the testing on existing. The algorithms

selected are designed for dense linear algebra, and are well known and well documented.

1.5.2 Numerical stability

A numerical stability analysis of the different algorithms is outside the scope of this project.

The algorithms are well known, and are well documented in terms of numerical stability.

That said, the applicability of an algorithm implementation obviously depends on it delivering

a correct result. All algorithms are implemented for both the GPU and CPU, and tests are

performed on both platforms to compare the results. The maximum difference in the result

indicates how well and precise the GPU implementation performs compared to the CPU

implementation.

1.5.3 IEEE 754 and double-precision

Devises with Cuda, supports double-precision floating point operations from Compute

Capability (CC) version 1.3 and higher. Support for double-precision operations in the

development and test computers is not a common denominator, furthermore double-precision

operations impact performance significantly, why I choose to implement algorithms using

single floating point operations.

The IEEE 754 standard for floating point arithmetic, is supported and followed by Cuda, but

there are documented deviations from the standard [2]. For instance the FMAD (multiply-add)

instruction looses precision by combining two operations into one instruction, and there are

other deviations. These deviations will influence the correctness test, but because

performance is prioritised higher than exact precision, I will not take any specific

precautions, as it would impact performance. The maximum differences from the correctness

test will give an indication as to how these deviations from the IEEE 754 standard, may affect

precision.

1.5.4 BLAS

Basic Linear Algebra Subprograms (BLAS) is an interface for linear algebra operations, and it

offers optimised operations for vectors and matrices. Many linear algebra algorithms are

designed on the basis of these operations, but the implementations in this project will not use

any BLAS API, even though Cublas would be obvious.

To really uncover the architectures capabilities it is necessary to experiment with it directly.

For that reason I implement all algorithms without the use of such math libraries. This will

mean that the full performance potential of the algorithms will not be achieved, but it will

give better insight.

2 Background

This chapter will work as an introduction to the ideas that will be used throughout the report.

Firstly linear algebra will shortly be described, after which the concept of parallel computing

in relation to the GPU.

2.1 Linear algebra

Hermann Grassmann is known as the inventor of linear algebra [3]. He did not invent and

describe the entire field, but recognized linear algebra as a formal theory. In his two releases

of “Ausdehnungslehre”, he describes some important ideas that helped define the basis of

linear algebra as it is known today.

Linear algebra is a term that covers several different topics and binds them together. Some of

these topics are system of linear equations, linear transformation, vectors, matrices,

determinants and vector spaces.

Frequently used in linear algebra are matrix-multiplication, LU- and QR-decomposition, which

each serves a purpose in either solving a system of linear equations, or a linear least squares

problem. These problems are then again often encountered in the fields of research,

engineering, physics, economics and statistics.

2.2 GPU computing

The performance and capabilities of graphic processors has gone through an incredible

development from the beginning, and up till today. From the command line based user

interfaces in the 1980s, to the more graphical driven interfaces from the 1990s and all the

way till today, graphic processing power has increased and has evolved to support 2D, 3D,

OpenGL, DirectX and more [4]. The release and popularity of 3D computer games furthermore

accelerated the demand and development of more and more powerful graphic processors.

Nvidia released on 31st August 1999 the GeForce 256, the release of the world’s first GPU [5],

a GPU that could perform graphical computations directly on the graphics processor. ATI, the

main competitor of Nvidia, soon followed, by releasing their Radeon R100 chips with the same

capabilities.

But it was not until 2001 that the major breakthrough in relation to GPU computing came.

Nvidia released the GeForce 3, the first chip to support Microsoft’s DirectX 8 standard, which

required the chip to support programmable vertex- and pixel shaders. ATI followed with the

release of their Radeon R300 chip in 2002.

Programmable vertex- and pixel shaders were intended solely for graphics rendering, however

they were actual small programs that performed a programmed computation on some input,

and then returned the output. The computational power of the GPU combined with the

programmable vertex- and pixel shaders feature made developers look into how the GPU,

could solve other problems than just graphics rendering.

3 Parallel platforms

The following chapter will take a deeper look at Cuda and GPU.NET; describe usage, features

and performance limiting factors. GPU.NET uses Cuda, so most energy will be on describing

and analysing Cuda.

3.1 Cuda

Cuda stands for Compute Unified Device Architecture and is a generic term covering the GPU

architecture of Nvidia’s graphic cards, development platform and tools. It can be described as

a parallel computing architecture and development platform that enables the GPU to solve

general purpose computational problems [4].

3.1.1 History

The first GPU was released in 2001, and in the early stages the only way to access the GPU

was through a graphics API, such as OpenGL or DirectX. This meant that general use of the

GPU was difficult. Nvidia saw the potential of the GPU as another computing platform, and

they initiated the development of a completely new architecture. This architecture was to

overcome the limitations of earlier GPU’s, by allowing General-Purpose computation on

Graphics Processing Units (GPGPU), without the need to use a graphics API.

Nvidia released in 2006 GeForce 8800 GTX, the first GPU to support the Cuda architecture.

Later, in June 2007 the first version of the Cuda development toolkit was released. Over the

years the architecture and toolkit have undergone development and improvements, with the

latest toolkit released May 2011.

3.1.2 Version

The latest version of the Cuda when this project initiated, was version 3.2, released

November 2010. Many things have happened since, and the current toolkit version, as of May

2011, was version 4.0.

Some of the new features include “Share GPUs across multiple threads”, “Use all GPUs in the

system concurrently from a single host thread” and “No-copy pinning of system memory, a

faster alternative to cudaMallocHost()”. Even though the last feature is interesting, none of

these newly added features bring any major benefit to this project, so any upgrade during the

project phase was deemed unnecessary. Hence, version 3.2 is used throughout this project

and report.

3.1.3 Cuda program

A Cuda program is a hybrid between code processed by the CPU and code processed in

parallel by the GPU. The CPU is referred to as the host, and the CPU code is called host code.

The GPU is referred to as device, and the code is surprisingly called device code.

A typical example of a Cuda program is shown in the following. The host code is written in

standard C or C++ as shown here:

1. // Declare device pointers 2. int *d_base, *d_n, *d_out;

3. int blocks = (N+THREADS_PER_BLOCK-1)/THREADS_PER_BLOCK; 4.

5. // Allocate memory on device

6. cudaMalloc( (void**)&dev_base, N * sizeof(int) );

7. cudaMalloc( (void**)&dev_n, N * sizeof(int) ); 8. cudaMalloc( (void**)&dev_out, N * sizeof(int) );

10. // Copy date from host -> device 11. cudaMemcpy( d_base, base, N * sizeof(int), cudaMemcpyHostToDevice); 12. cudaMemcpy( d_n, n, N * sizeof(int), cudaMemcpyHostToDevice); 13. 14. // Execute kernel 15. power<<<blocks,THREADS_PER_BLOCK>>>(d_base, d_n, d_output, N); 16. 17. // Copy data from device -> host 18. cudaMemcpy( out, d_out, N * sizeof(int), cudaMemcpyDeviceToHost); 19.

20. // Free memory on device 21. cudaFree( d_base ); 22. cudaFree( d_n ); 23. cudaFree( d_output ); 24.

25. // Let the Cuda runtime now that we are finished 26. cudaThreadExit();

The device code is structured in a function called a kernel, as shown here:

1. // Device kernel called power 2. __global__ void power( int *base, int *num, int *output, int N ) {

3. 4. // The unique thread id

5. int tid = threadIdx.x + blockIdx.x * blockDim.x;

7. // Guard, only if thread has actual data to process 8. if (tid < N) {

10. // Initialise register with values from input array 11. int p = 1, base = base[tid], num = n[tid]; 12. 13. // Compute p 14. for (int i = 1; i <= num; ++i) 15. p = p * base; 16. 17. // Write result to output array 18. output[tid] = p; 19. } 20. }

The host sends commands and messages to the device by invoking functions. A sequence

diagram, based on the program illustrated above, is shown in Figure 1.

Figure 1 - Cuda program sequence diagram

Line 6 to 8 in the host code allocates memory on the device, line 11 and 12 copies data to the

device and the kernel is invoked in line 15. When the kernel is done processing, the host

copies the data back from the device memory in line 18, where after the memory is release

again in line 21-23.

3.1.4 Architecture

Cuda is an architecture consisting of both the physical layout of the GPU and the logical

structure of threads in the Cuda runtime. The exact physical layout and specifications differs

for different chip versions1, and the capabilities of these chips are defined by the Compute

Capability version (CC). The first chips were released with CC 1.0, and the latest version is

Physical layout

The GPU shown in Figure 2 is a simplified G86 GPU with CC version 1.1. It consists of four

streaming multiprocessors (SM) and each SM contains 8 streaming processors (SP) and two

special function units (SFU). In a SM, the SP processes normal instructions like add and

multiply, and in this case 8 SP’s are able to process 8 normal instructions per clock cycle. The

SFU processes instructions related to square root, sine and cosine, logarithmic and

exponential, so a kernel with heavy usage of these instructions will be limited to only two

instructions per clock cycle.

A SM has, beside the SP and SFU, also access to different memory types. The register and

shared memory are limited in size, but very fast. They are in addition to this local to each SM.

The register is the fastest memory type, and local to a thread, and the number of 32-bit

registers of a SM with CC 1.1 is 8192, or 8K. The shared memory is a bit slower, but there is

more of it and it is shared between the threads in a block. The shared memory size of a SM

with CC 1.1 is 16KB.

All of SM’s of a GPU has read access to the constant memory, which for all current CC

versions is 64KB. Access to the constant memory is cached and generally faster than the

global memory, which is the device’s main memory. All SM’s have shared access to it, and the

exact size and speed is device dependent.

1 For instance G80 allows 768 threads per multiprocessor, GT200 allows 1536.

Figure 2 - Cuda architecture with four multiprocessors

Memory

As described above, there are different memory types, but the common denominator is that

they are all typically based on dynamic random access memory (DRAM). Accessing a single bit

in a DRAM cell is a slow process, and to improve performance DRAM controllers read several

consecutive bits in parallel [4]. This means that actual random access to DRAM memory will

yield a low performance. So to achieve the highest memory performance possible, the kernel

should access consecutive memory locations, as much as possible. This is also called

coalesced memory access.

Accessing memory in a coalesced manor is important for all memory types. This also holds for

shared memory even though it is on-chip and fast, and in addition to this, access to shared

memory should also minimise bank conflicts. Shared memory on CC version 1.1 has 16 banks,

and the bandwidth of each bank is 32 bits per two clock cycles. If two or more threads access

the same bank, the access will be handled sequentially and hence impact performance. From

CC version 2.0 and higher, simultaneous access to the same bank has been optimised. Multiple

reads from a single bank only results in a single read instruction being performed, after which

the values is broadcasted to all threads.

Cuda threads

Cuda threads are very different from CPU threads; the only similarity is the fact that they

process data in parallel. The GPU can be classified as SIMD, which makes its applicability

differ from that of a CPU. A task, or a set of instructions, can be performed on different data

in parallel, and the SIMD means that two independent tasks cannot be performed in parallel

by the GPU. Furthermore, threads in Cuda are very lightweight compared to threads on the

CPU. A typical Cuda program uses several thousand Cuda threads, and thread generation and

scheduling should therefore not be considered a limitation.

Cuda threads are organised into a block, and the threads in a block can share memory and be

synchronised. The blocks are then again organised into a grid. This logical thread structure

allows threads to be organised in several dimensions, which makes structuring of threads

directly correlate to specific data structures, a matrix for instance is defined in two

dimensions.

More details about how threads are organised and how this affects usage in a kernel please

refer to appendix B.

Thread scheduling

The presumption is that the threads of a block are grouped, and processed in parallel. This is

conceptually true, but in reality is not actually happening. The current implementation of

thread scheduling in e.g. G80 and GT200 chips schedules threads using a term called a warp.

A warp is a bundle of 32 threads being executed in parallel, and a block with for instance 128

threads, are partitioned into 4 warps.

These threads share a single instruction set, hence Cuda is SIMD architecture. This is a design

decision to reduce hardware cost and to enable optimisations techniques, and it is not

without relevance to the developer.

The size of a warp has direct impact on the recommended size of blocks. Consider the

example where a problem is organised into 20 blocks each with 10 threads, giving a total of

20 x 10 = 200 threads. Cuda executes 32 threads in a warp in parallel. In the example above,

only 10 threads are available per block. Cuda will in this case fill up the warp with 22 empty

threads, resulting in 20 x 22 = 440 empty threads being created. It is advisable to set the

block size to a multiple of the warp size, currently 32 [4].

Occupancy and latency

A SM with a CC of version 1.1 is able to handle 768 residing threads, and as the current warp

size is 32, the maximum number of residing warps per SM is 24. The actual number is

dependent on the kernels consumption of registers and shared memory. The Cuda occupancy

is, for a given kernel, the ratio of active warps to the maximum number of warps supported

by the SM. In other words, the occupancy indicates how many active warps and threads a SM

can hold.

The number of clock cycles it takes for a warp to be ready to execute its next instruction is

called latency. There are instructions that incur latency, for instance global memory access,

which incurs high latency before the data is supplied. The execution of a warp does not halt

due to a memory access; execution is continued until the data is actually needed, if the data

is still unavailable the scheduler switches to another warp. Whenever a warp incurs latency,

the SM should switch to another warp and start processing to achieve full utilisation of the

SM. So the Cuda occupancy ratio can indicate if the performance of a kernel suffers from high

latency. Vasily Volkov [6][7] have however shown that high performance is not necessarily

equal to a high occupancy, so improvements based solely on the occupancy ratio should be

carefully evaluated.

Optimisation that aims at full utilisation of the SM is called latency hiding.

3.1.5 Limitations

Using Cuda can be advantages, but there also limitations that are influence the

implementation and optimisation of algorithms. In the following I describe a couple that are

relevant to this project.

One should be aware that the Cuda architecture was developed for speed at the expense of

precision. There can, for that reason, be a higher numerical instability of an algorithm

implemented on the GPU, when compared to the same algorithm implemented on the CPU.

For example, the operations multiply and add, can be contracted to a single FMAD (multiply-

add) instruction, which specifically deviates from the IEEE 754 standard. FMAD instructions

are for instance often used in linear algebra algorithms to calculate dot-products, vector

norms and more. Nvidia has been focusing on this, and latter CC versions should comply

better with the IEEE 754 standard.

Latency, warps and memory access described in the architecture chapter 20, are all factors

influencing the computational performance. One should therefore not expect to reach close

to the theoretical performance of a device, as these factors will limit the performance of an

algorithm. The theoretical properties can, on the other hand, be of assistance in the analysis

and optimisation of a kernel.

Cuda C is an extended version of ANSI C, and is the language in which the device code or

kernels are written. A kernel function is able to call other device functions, but recursion is

currently not supported.

Cuda is developed by Nvidia, and can only be used in Nvidia GPUs.

3.2 GPU.NET

The framework GPU.NET consists of a runtime and a compiler, which are integrated with

Visual Studio. This framework makes it possible to develop host and kernel code directly in

.NET with all the benefits that .NET and Visual Studio provides.

The version being used for this project is GPU.NET v1.0.3.5.

3.2.1 Overview

GPU.NET currently only supports Cuda, but expect to support other parallel architectures in

the future. GPU.NET allows a developer to write host and device code directly in .NET using

the API from the provided assembly, and thereby making computations on hardware

accelerated architectures.

Accelerating .NET code is achieved in two steps; first the .NET code is written, decorated and

compiled, then the GPU.NET runtime accelerates the program during execution, as shown in

Figure 3.

Figure 3 - How GPU.NET works as describe on TidePowerd.com

3.2.2 Development

Visual Studio 2010 is supported for development as well as .NET 4. The kernel is annotated

with a KernelAttribute that also holds the name of the CPU method to be used if no

acceleration hardware is present. ThreadIndex and BlockIndex hold the same values as when

used in Cuda directly.

1. [Kernel(CustomFallbackMethod = "MatrixMultiplication_CPU")]

2. private static void MatrixMultiplicationSimpleNS_GPU(float[] a, float[] b, float[] c, int aheight, int awidth, int bwidth)

4. // Thread ID 5. int tid = ThreadIndex.X + BlockIndex.X * BlockDimension.X;

7. if (tid < aheight)

8. { 9. for (int j = 0; j < bwidth; ++j)

10. { 11. float sum = 0; 12. 13. for (int k = 0; k < awidth; ++k) 14. { 15. float av = a[tid * awidth + k]; 16. float bv = b[k * bwidth + j]; 17. sum += av * bv; 18. } 19. c[tid * bwidth + j] = sum; 20. } 21. } 22. }

The kernel is returns void, and is private, and can therefore not be called directly. Another

public and static method is created, which calls the kernel shown in line 5.

1. public static float[] MatrixMultiplicationSimpleNS(float[] a, float[]

b, int aheight, int awidth, int bwidth) 2. {

3. var c = new float[aheight * bwidth];

4. 5. MatrixMultiplicationSimpleNS_GPU(a, b, c, aheight, awidth, bwidth);

7. return c;

The .NET code is compiled to a normal assembly, in which the GPU.NET compiler then injects

calls to the GPU.NET runtime. The result is a modified .NET assembly where calls to any

kernel method are being redirected to the GPU.NET runtime, and hence the GPU.

3.2.3 Execution

When the program is being executed, the GPU.NET runtime detects the availability of any

supported hardware. When a call to a kernel is detected, the kernel code is then passed to

the correct vendor plug-in, which in turn JIT compiles the code to the hardware vendor’s

instruction set architecture. Lastly the runtime executes the compiled device code and

transfers any data back to the .NET runtime. If no hardware acceleration is present, then the

CPU version of the kernel is called.

3.2.4 Limitations and bugs

GPU.NET is a relatively young framework that contains obvious bugs and limitations. The

v1.0.3.5 contains the following bugs and limitations:

• There is currently only support for Nvidia Cuda v3.0 and newer, but support for AMD

devices are under development.

• Local variables and parameters can only be of primitive types. Parameters can in

addition to that, also be an array of a primitive type.

• Shared memory can only hold fields which are primitive types or a single dimensional

array of primitive types.

• Kernels must be static and return void and cannot be recursive or call any other

methods.

I have experienced problems with casting variables in the kernel. Casting variables from

double to float, or even float to float resulted in compile errors. My conclusion is that casting

is not supported and should be avoided. So it is necessary to design an algorithm exclusively

for either single- or double precision.

The shared memory size of a kernel must be specified on compile-time; this can in Cuda C be

specified dynamically on runtime. Implementing kernels optimised for different data sizes is

therefore difficult, which has led me to set the size of the allocated shared memory high.

This makes sure that kernels will run with different data sizes, but this is not optimal. Shared

memory is a scarce resource, and will lower occupancy.

Occasionally when a CPU.NET application was executed for the first time, an exception was

thrown. Subsequent executions were processed with no problems. Furthermore, due to a

thread bug a GPU.NET program does not exit by itself. I have solved this by terminating the

process by calling:

1. System.Environment.Exit(0)

The last two bugs are expected to be fixed in newer releases.

3.2.5 Evaluation

GPU.NET can definitely be used for testing and playing with GPU acceleration of programs.

But one should, with the current version, expect bugs and minor problems; the framework is

far from mature at this point.

Furthermore, by using JIT compilation the GPU.NET will incur a performance hit compared to

Cuda, as Cuda kernels are already compiled at runtime. This is a design decision made by

TidePowerd, a decision that makes the framework flexible at the expense of performance.

GPU.NET does however cache the JIT compiled kernel in-memory, so subsequent calls to the

same kernel will not incur the same performance hit.

4 Hardware platform

The parallel architecture software has been described above, but there are also hardware

specifications that are important to the performance of Cuda. The following chapter will

analyse hardware specifications and perform two simple tests.

Cuda requires Nvidia GPUs, so all development and test computers must be equipped with an

Nvida GPU. I used two development and two test computers for this project. All machines are

running Windows 7, and have Cuda v3.2 installed, including the matching drivers. The

following table shows selected hardware specifications for the platforms.

System #1 #2 #3 #4

Type Development Development Testing Testing

Graphics bus 8.00 GB/s

(PCI-E v2.0)

4.00 GB/s

(PCI-E v1.1)

8.00 GB/s

(PCI-E v2.0)

4.00 GB/s

(PCI-E v1.1)

Host memory 16.60 GB/s

(DDR3)

6.23 GB/s

(DDR2)

16.60 GB/s

(DDR3)

6.25 GB/s

(DDR2)

GPU GeForce 9400m

GeForce 8800 GS

Tesla C1060

(GT200)

GeForce GT440

(GF108)

Cores 16 96 240 96

Shader clock 1100 MHz 1250 MHz 1300 MHz 1645 MHz

Device memory 8.00 GB/s2

(DDR3)

37.45 GB/s

(GDDR3)

102.40 GB/s

(GDDR3)

51.20 GB/s

(GDDR5)

Processing power in

gigaflops (MUL+ADD+SF) 34.38 351.56 914.06 462.66

Processing power in

gigaflops (MUL+ADD) 51.56 234.38 609.38 308.44

Compute Capability 1.1 1.1 1.3 2.1

Table 1 - Hardware specifications for the four platforms

2 The actual memory speed is 16.60 GB/s, but system #1 has no dedicated device memory and uses host memory. So the device is limited by the speed of the graphics bus.

The reason why the GPU of each system has two processing powers stated, is based on the

theoretical peak performance in gigaflops Cuda architecture design. The first is based on the

GPU architecture design, which says that a GPU is capable of performing a Multiply-Add

instruction dual-issued with a special function instruction per operation cycle. The second is

based on a more realistic estimation, in which a operation cycle can perform a Multiply-Add

instruction dual-issued.

For a detailed description of the different platforms, please refer to appendix C.

4.1 Analysis

This paragraph will dig a little deeper into the important hardware specifications, and

describe the theoretical performance limits. When dealing with GPUs the important factors

are memory transfer rates and GPU processing power. Figure 4 shows a simplified chipset

diagram, which highlight the important elements, namely the processor, DDR ram, GPU and

the chipset’s north bridge.

Figure 4 – Simplified diagram of a chipset

In Table 1, the graphics bus indicates the maximum transfer rate between the north bridge

and the GPU device. Host memory is the peak bandwidth between the DDR ram and the north

bridge. A chain is not stronger than its weakest link, and the same holds for data transfer

between the host and device, and vice versa. Consider platform #1, the bandwidth of the host

memory is 16.60 GB/s, but the graphics bus is limited to 8.00 GB/s, which then is the

theoretical peak transfer rate between the GPU device and the host system.

4.2 Benchmarking

The specifications are theoretical, and to give a more realistic performance target I have

tested actual data transfer and processing power.

4.2.1 Memory performance

All systems have been tested with three types of memory transfers; from host to device,

device to host and device to device. To improve performance of memory transfer, host

memory can be defined as page-locked (pinned) and write-combined. Pinned and write-

combined memory is a scarce resource of the operating system. A Cuda program can

therefore not without caution, consume this operating system memory resource, as it could

impact overall system performance. Bandwidth measurements, shown in the following table,

are performed with both regularly paged memory, and with pinned and write-combined

memory (P+WC). For more details please refer to appendix B.

System #1 #2 #3 #4

Host -> Device 1,584.5 MB/s 1,434.7 MB/s 4,233.7 MB/s 1,578.6 MB/s

Host -> Device (P+WC) 5,224.9 MB/s 2,513.1 MB/s 5,761.9 MB/s 2,509.8 MB/s

Device -> Host 1,365.9 MB/s 1,178.0 MB/s 3,864.3 MB/s 1,235.1 MB/s

Device -> Host (P+WC) 5,096.2 MB/s 1,687.9 MB/s 5,297.6 MB/s 1,857.9 MB/s

Device -> Device 6,935.4 MB/s3 28,525.7 MB/s 73,463.8 MB/s 21,338.1 MB/s

Device -> Device (P+WC) 6,951.3 MB/s3 28,529.0 MB/s 73,527.3 MB/s 21,339.2 MB/s

Table 2 - Measured bandwidth of Cuda memory transfer operations

3 System #1 does not have any dedicated device memory, so actually the rates from host to device are relevant when a kernel needs to access ”device” memory.

The measured transfer rates between host and device are, as expected, faster when using

pinned and write-combined memory. The ratio between measured bandwidth for paged

memory transfers are between 17% and 40% of the theoretical bandwidth, the ratio span

increases to between 47% and 72% when pinned and write-combined memory is used. The

result also shows that in general, device to host transfers are slower than its counterpart host

to device.

The memory speed for copying data from host to device and vice versa is mostly important for

hybrid algorithms, meaning an algorithm that solves a problem by using both the CPU and

GPU. An implementation that requires transfer between the host and the device should be

designed with caution, as this would put a restriction on performance. Based on this, the

strategy for the algorithm implementations of this project is to keep the data processing

solely on the GPU, and limit the number of data transfers between host and device. Consider

the worst case memory transfer scenario. System #3 has 4GB of device memory, and if this

GPU was installed in the slowest system, it would be possible to copy all 4GB in about 3.57

seconds.

Global memory access is limited by device memory bandwidth, so the device to device

memory transfer rate is interesting and relevant to the performance of a kernel. The results

are between 41% and 75% of the theoretical limit. Paged or pinned/write-combined memory

does not have any impact as this is an operating system resource, and hence only relevant for

the host memory.

The result of the device to device memory transfer for system #1 is a bit misleading. System

#1, does not have any dedicated device memory, and a device to device transfer rate then

only indicates the peak performance of the DDR3 ram on the host system. Before the data

could actually be processed by the GPU, it would have to pass the north bridge and graphics

bus, which is the same as the host to device memory transfers.

4.2.2 Arithmetic performance

To get a realistic performance target in gigaflops for the GPU, I have created two kernels

each consisting of three operations MUL+ADD+SF. The first kernel is normal in the sense it

reads data to process from global memory, by doing so the kernel is limited by global memory

performance. But as a normal kernel would make computations based on data from global

memory, the peak performance of this kernel can be considered a normal case scenario and

help define performance expectations.

The second kernel does not access global memory, but consist of just the three MUL+ADD+SF

operations. The peak performance in gigaflops of this kernel should be closer to the

theoretical maximum arithmetic performance.

System #1 #2 #3 #4

MUL+ADD+SF + global read (gigaflops) 2.19 9.70 27.87 8.18

MUL+ADD+SF (gigaflops) 15.41 60.09 62.29 29.08

Table 3 - Measured gigaflops performance of GPU

In reality, these kernels do not say anything about the maximum expected performance. An

implementation of an algorithm can be optimised in several ways, and so could these kernels.

However, they do suggest that global memory access is indeed a limiting factor. This factor is

referred to as Compute to Global Memory Access (CGMA).

The first kernel has a memory load from input and a write to output. The number of

operations are three (multiply, add and power). Based on these numbers, the calculated

CGMA is 1.5. Consider system #3, the device memory peak performance is 73,463.8 MB/s. The

kernel uses single floating values that are 4 bytes. So the system is able to transfer about

18,365.95 mega single float values. The CGMA is 1.5; hence the peak performance of this

kernel is about 27 gigaflops, which the result from Table 3 also shows. The memory transfer

rates, together with an estimated CGMA, can be important tool when analysing a kernel for

optimisation.

With these results in mind, how are they compared to the performance of a CPU? Consider

that a Pentium 4 3.06 GHz CPU computes a single-precision float values dot-product with

between 1.8 gigaflops (single thread) and 3.08 gigaflops (multiple threads) [8]. In that light,

even though the results from Table 3 are far lower than the theoretical processing power, it

is evident that even un-optimised kernels could have a similar peak performance or even

higher, when compared to the CPU.

5 Implementation

In the following chapter I describe the development environment and some design decisions,

but more importantly, I form an optimisation strategy used for the algorithms.

5.1 Development environment

The computers used in this project are based on Windows 7, and the Cuda toolkit version 3.2

was the latest release when this project was initiated. Cuda v3.2 natively supports Visual

Studio 2008 (VS2008). It is possible to enable development in Visual Studio 2010, but has

proven difficult to setup.

Making VS2008 ready for Cuda development is not a trivial task. The compilation process

includes two compilers, the Nvidia Cuda Compiler (NVCC) and Microsoft’s Visual C++ compiler

(VCC). To configure VS2008 properly and making the NVCC, VCC and linker play together

remained a challenge. Please refer to appendix D for a detailed description of the problems

involved, and for a development model solution.

5.2 Design decisions

The general rule is, the parts of the algorithm that exhibit little or no data parallelism should

be processed by the host, the parts that exhibit rich amount of data parallelism should be

processed by the device. Sometimes it is beneficial to process code on the device that cannot

exploit the parallelised architecture. The decisive factors in these situations are the size of

the data, and the time needed to transfer data between the host and the device. The

strategy I will follow is to limit transfer of data between host and device to a minimum [9],

by reducing these transfers to an initial and a final one, like this:

1. Copy data to device

2. Process data on device

3. Copy data to host

This means that these data transfers are not part of the actual algorithm, and when

measuring peak processing power in gigaflops, I only measure the core algorithm execution

time. Meaning, the initial configuration and data transfer, combined with the releasing of

resources and retrieval of the output, is not being measured. By exclusively measuring the

core algorithm, it is possible to directly compare the peak processing power of the GPU with

that of the CPU.

As mentioned earlier, support for double-precision operations is not a common denominator

for the development and test machines. Algorithms are therefore implemented using single-

floating point precision, which is supported by all GPU devices, the CPU and GPU.NET.

5.3 Optimisation

The aim is to implement and optimise three linear algebra algorithms for the Cuda

architecture. The method for doing so is composed of the following steps:

1. Use an existing, well-known and well-documented algorithm, for implementation in

C/C++ for CPU processing.

2. Analyse and update CPU implementation to Cuda C, while making simple

improvements that exploits the parallelised architecture.

3. Test the implementations.

4. Optimise based on the test results, and test again.

5.3.1 Strategy

The Nvidia paper on “Analysis Driven Optimization” [10] identifies four categories of what can

limit a kernels performance; memory throughput, instruction throughput, latency or a

combination of the above. There are some methods that can be helpful in finding the limiting

factors of a Cuda program. To determine if memory throughput is a limiting factor, the CGMA

of a kernel can help determine the theoretical maximum performance of a kernel. When it

comes to instructions, the Nvidia profiler can give valuable information about undesirable

Please refer to appendix E for at description of the Cuda profiler and CGMA.

There exist different optimisation techniques and methods, and some have already been

described in the chapters 3.1.4 and 3.1.5. In the following I will describe methods that form

the optimisation strategy.

Algorithms that process data rely on memory to perform well. Coalescing memory access is

important for all memory types, and in addition to this, shared memory should avoid bank

conflicts as much as possible.

A loop structure in a kernel adds extra control flow instructions, which will consume

arithmetic resources. The organisation of threads in several dimensions can enable the

unrolling of a loop by increasing thread granularity. The compiler already unrolls small loop

structures, but doing it manually can help making a kernel run faster.

Whether the block size should be high or low depends on the kernel, but it should where

possible by a multiplier of the warp size (currently 32), to avoid empty threads.

Hiding latency of slow instructions can be achieved by reorganising the kernel, exploiting data

prefetching or making sure that the kernel Cuda occupancy is high. Notice that a high

occupancy is not equal to high performance [6][7].

Vasily Volkov has shown that a kernel can increase performance, by instead of outputting a

single result, then outputting several results per kernel. It has also been proved that using

this method, high performance can be achieved with a low occupancy.

Another technique for increasing performance focuses on using fast memory, such as the

register or shared memory. Updating an implementation so that it divides data into smaller

pieces, called tiles, that fits into caches or shared memory can be very effective. The cost of

copying data is amortised, and the kernel will process the cached data.

Some algorithms are not designed for parallel processing, and the performance they can

deliver on a parallel architecture, is not very high. Instead, for some algorithms, a block

version has been designed. A block algorithm usually has three advantages over normal

algorithms. Firstly, they are able to solve much larger problem, by dividing the problem into

smaller pieces and solving them independently. Secondly, dividing a problem into smaller

sizes is the core of the tiling implementation strategy, so using a block algorithm can

automatically enables the tiling. Lastly, the block algorithms sometimes rely on other linear

algebra operations that are highly parallel, for instance matrix-multiplication.

So to clarify, tiling refers to a specific implementation that exploits a faster memory type,

whereas block refers to the algorithm. For some algorithms block and tiling is almost the

same (e.g. matrix-multiplication), for others they are not.

6 Matrix-multiplication

A matrix is essentially a rectangle array of numbers, and is often denoted with a capital

letter. Here the matrix A with two rows and three columns is shown.

� = �� The numbers or values in a matrix are called elements, and are by convention denoted � where r is the row index and c is the column index. The row index indicates in which row the

element lies, where the column index indicates the column in which the element lies.

Matrix-multiplication, also called matrix product, is a linear algebra matrix operation

consisting of the operations multiplication and addition. Elements in the respective matrices

are aligned, multiplied, added, and then the grand sum is placed into the resulting matrix.

The process of performing matrix-multiplication on two matrices is only possible, if their

dimensions conform for multiplications, meaning the number of columns of the first matrix

should be equal to the number of rows in the second. The resulting matrix will be a � × matrix, where � is the number of rows of first matrix and is the number of columns of the

second.

Figure 5 - Matrix-multiplication process depicted

It should be further noted that this operation is not commutative, hence � × � ≠ � × �. Except for special cases, where matrix-multiplication actually is commutative. These cases

are however not described in any further details, as they are outside the scope of this report.

The naive process of matrix-multiplication is rather simple. The data size of two square

matrices is 2 ∗ � and the running time is O( �) where n is the width and height of the

matrices. This shows that the running increases more than the data size.

The simple or naive implementation will be discussed and shown later, but there exist other

algorithms which are more efficient, for instance the Strassen's or Coppersmith–Winograd

algorithms [11]. However these algorithms add complexity to the implementation, and

require extra attention to handling numerical stability issues.

This project’s matrix-multiplication focus should be on optimisations for the GPU platform,

and not the algorithm itself. Then, for that reason, will only the simple implementation serve

as a base for analysis, implementation and testing, and not the other algorithms mentioned.

Optimisations applied to the simple algorithm will focus on capabilities and properties of the

GPU platform, and the essence of the original matrix-multiplication algorithm, will be kept.

Futhermore, Cuda is the focus of this project, meaning the implementation and optimisations

will be focussed on Cuda and the C/C++ implementations. GPU.NET will be used in the result

section as a perspective and for comparison.

In the following the matrix named A will always reference the first matrix of the matrix-

multiplication process. The second matrix will be named B and the resulting matrix C, like so:

� ∗ � = �

6.1 Analysis

Parallel processing on a GPU platform is stream based, and supports the parallelisation of

data very well. The simple nature of the matrix-multiplication algorithm makes the

implementation, for processing on a GPU platform, straightforward. The fact that the

algorithm has running time of O( �), makes optimisations and performance gains easier to

test and time on different platforms, and with different data sizes.

6.1.1 The sequential algorithm

As mentioned earlier, the focus will be on the simple matrix-multiplication algorithm. The

sequential algorithm consists of the following steps:

1. for (int i = 0; i < A.rowCount; ++i) 2. for (int j = 0; j < B.columnCount; ++j) {

3. double sum = 0; 4. for (int k = 0; k < A.columnCount; ++k) {

5. double a = A[i][k];

6. double b = B[k][j]; 7. sum += a * b;

9. C[i][j] = (float)sum;

10. } 11. }

This algorithm consists of three loops and the inner loop has an addition and a multiplication

operation. This also shows that the running time is O( ∗ � ∗ �) where n is the number of rows

in A, m the number columns of A and rows in B, and lastly p is the number of columns in B.

For square matrices where = � = � the running time is O( �). The inner loop computes the dot-product of the vectors of A and B. The two outer loops are

responsible for iterating through the rows of A and columns of B, and their mutual order does

not influence the running time.

6.1.2 Parallelism

The simple matrix-multiplication algorithm consists of three loops. One adjustment to induce

concurrency is to perform the outer loop in parallel; another to calculate each value of the

resulting matrix in parallel.

In any case, there is not just one single solution to making matrix-multiplication work in

parallel, but multiple. These different approaches will be discussed in the following.

The outer loop

A simple approach to make the outer loop work in parallel is to make threads handle each

row in A. This is possible as there are no synchronization issues to handle, but it means that

the total number of required threads will be equal to the number of rows in A.

This adjustment is doable; however any performance gains or losses are dependent on the

data size. Consider the case where A is a column matrix and B a row matrix, this would mean,

many threads that do little work.

Resulting matrix values

Calculating the different values of the resulting matrix is another way of making the

algorithm work in parallel. By using this method the required number of threads will be equal

to the size of the resulting matrix C, which is:

�� = �� ∗ �� This shows that this approach also is prone to the problems where A is a column matrix and B

a row matrix. Again, many threads do little work. For simplicity, I will not handle this case

specifically, but will later present an algorithm that performs better with different data

sizes.

6.2 Simple algorithm

Initially a simple and straightforward matrix-multiplication algorithm is implemented and

tested. Based on the results and the properties and capabilities of the Cuda architecture,

different optimisation techniques are implemented, tested and then evaluated.

6.2.1 The algorithm

The sequential algorithm described in the analysis is used to calculate the resulting matrix on

the CPU. This reference matrix will be used as a comparison to the GPU calculated result

matrix, and as such function as the correctness test.

As described in the analysis, the algorithm can be made parallel by making the outer loop or

the calculation of the resulting matrix values processed in parallel. These methods are

straightforward but have some drawbacks, meaning the performance is dependent on the

data size. The solution is to find the best balance between threads and their workload, and

the means is segmentation of data in blocks.

The kernel to the first solution looks like this:

1. __global__ void matrixMultiplicationSimple(matrix *a, matrix *b, matrix *c) {

3. // Thread ID 4. int tid = threadIdx.x + blockIdx.x * blockDim.x;

5. double sum, av, bv;

7. if (tid < a->height) { 8.

9. for (unsigned int j = 0; j < b->width; ++j) {

10. 11. sum = 0; 12. 13. for (unsigned int k = 0; k < a->width; ++k) { 14. av = a->n[tid * a->width + k]; 15. bv = b->n[k * b->width + j]; 16. sum += av * bv; 17. } 18. c->n[tid * b->width + j] = (float)sum;

19. } 20. }

What is important to note is that the kernel has a loop in a loop making the running time of a

single thread O( ∗ �) where n is columns in B and m the columns in A.

The kernel that calculates the values of the resulting matrix does not have this double loop.

1. __global__ void matrixMultiplicationRM(matrix *a, matrix *b, matrix *c) {

3. // Matrix C coordinates 4. int c_column = blockIdx.x * blockDim.x + threadIdx.x;

5. int c_row = blockIdx.y * blockDim.y + threadIdx.y;

6. double sum, av, bv; 7.

8. // Make sure not to exceed C boundaries

9. if (c_row < c->height && c_column < c->width) {

10. 11. sum = 0; 12. 13. for(int i=0; i < a->width; i++) { 14. 15. av = a->n[c_row * a->width + i]; 16. bv = b->n[i * b->width + c_column]; 17. sum += av * bv; 18. } 19. 20. c->n[c_row * b->width + c_column] = (float)sum; 21. }

The two kernels are different in the sense that the first does more work than the last. In the

last kernel, a loop structure was unrolled, the firstly makes the threads more fine-grained,

which have a higher parallel potential. Secondly, the control flow instructions from the loop

are not performed, releasing more resources for the kernel.

6.2.2 Test and results

Testing is performed on different platforms, and to dedicate most performance possible of

the GPU to the algorithm, rather than rendering of the results, the tests are executed using a

console program.

Figure 6 – The output of the c

Matrix-multiplication is

matrix C will be 200 ×Outer loop

The Cuda occupancy calculator showe

have a multiprocessor occupancy of 83%.

of the matrix structure as

The kernel running time is

the time to perform data transfer. The

whether it is feasible to perform matrix

only whether the GPU calculates the r

measuring the calculation time, it is easy to directly compare the

GPU with that of the CPU.

Platform Kernel running

Table 4 - Test result of outer loops matrix

The output of the console testing program

is tested where matrix A is 200 × 400 and B is 400800. The Cuda occupancy calculator showed that the simple outer loop implementation would

have a multiprocessor occupancy of 83%. The outer loop implantation was tested

of the matrix structure as a parameter for the kernel.

The kernel running time is, for Cuda, the direct GPU calculation time; hence it is exclusive

the time to perform data transfer. The Cuda kernel running time does not say anything about,

whether it is feasible to perform matrix-multiplication on the GPU compared to the CPU, but

only whether the GPU calculates the resulting matrix faster than that of the CPU.

the calculation time, it is easy to directly compare the computation

GPU with that of the CPU.

Kernel running

Operations/ms Gigaflops/sec

349.15 ms 366.609 0

Test result of outer loops matrix-multiplication on platform #1

400 × 800. The resulting

d that the simple outer loop implementation would

The outer loop implantation was tested with the use

ation time; hence it is exclusive

kernel running time does not say anything about,

multiplication on the GPU compared to the CPU, but

esulting matrix faster than that of the CPU. By just

computation time on the

/sec CPU running

0.37 250.90 ms

multiplication on platform #1

Some interesting results have emerged from this initial test. First of all, there is a difference

in the result calculated on the GPU from that calculated on the CPU. The maximum

difference in the resulting values is 0.010742 performed on platform #1. The GPU

architecture was initially designed for increased speed, on the cost of precision which partly

explains the difference in the resulting values. Newer architectures implement an instruction

set with increased precision, this kernel have been tested on an architecture with compute

capability v2.0, where the difference between GPU and CPU results was 0.0.

Another surprise to see from the test, the GPU calculation actually has a peak performance of

0.3666 gigaflops, and takes longer than on the CPU. What causes such a bad performance

might one ask?

Outer loop without structure

Global reads are expensive and coalesced memory reads should be achieved to optimise

performance. Structures can, if not aligned, produce non coalesced memory access.

Whether using structures as parameters for the kernel had any impact on performance, would

be interesting to test. So, minor adjustment to the code where made to eliminate structures

as parameters, and the updated kernel function definition now looked like this:

1. __global__ void 2. matrixMultiplicationSimpleNS(float *a, float *b, float *c, int aheight,

int awidth, int bwidth)

The adjustment meant that the occupancy of the multiprocessor rose from 83% to 100%, an

increase, which suggested that better performance could be expected. But the kernel

calculation running time was tested to 357.88 ms. A running time that is approximately the

same as using structure parameters.

Even though a higher occupancy suggested increased performance, no performance gain was

achieved. This result is confirmed by Vasily Volkov test on “Better performance at lower

Occupancy” [6].

So the first optimisation will look into reorganising the threads, to try if Cuda performs better

when more threads perform less, than when few threads perform more. But before doing so,

testing and comparing performance with GPU.NET would indicate, whether the GPU.NET API

performs on the same level as using Cuda directly.

GPU.NET

The GPU.NET platform has different limitations; one is that kernel methods only support

primitive types as parameters. Of that reason, testing matrix-multiplication with the Matrix

structure is not possible. So the outer loop matrix-multiplication method was implemented

without structures, and can therefore be directly compared to the similar Cuda

implementation.

It is furthermore not possible, on the GPU.NET platform, to measure solely the direct

calculation time, the kernel running time is inclusive data transfer and JIT compilation.

Platform Kernel running time Operations/ms Gigaflops/sec CPU running

Cuda 357.88 ms 357,661 0,36 381.23 ms

GPU.NET 409.00 ms* 312.958 0.31 387.00 ms

Table 5 - Test result of outer loops matrix-multiplication no structure on platform #1

* Inclusive data transfer and JIT compilation

Taking data transfer and JIT compilation into account, the performance of GPU.NET and Cuda

are almost identical. It will be interesting to see whether this is also the case, when different

optimisation techniques and features are exploited.

6.3 Optimisation

The Cuda architecture has different characteristics and capabilities, and to optimise

performance different features and techniques can be utilised. First the unrolling of a loop

will be tried, after which the tiling and other methods from the strategy will be applied.

6.3.1 Unroll loop with threads

The simple implementation was designed so fewer threads performed more, the first

optimisation will try and uncover if more threads performing less, by unrolling a loop,

actually is better. This can be achieved by modifying the algorithm, so each thread calculates

a value in the resulting matrix.

The modified kernel is shown in the following:

1. __global__ 2. void matrixMultiplicationRM(matrix *a, matrix *b, matrix *c) {

3. 4. // Matrix C coordinates

5. int c_column = blockIdx.x * blockDim.x + threadIdx.x;

6. int c_row = blockIdx.y * blockDim.y + threadIdx.y;

7. double sum, av, bv; 8.

9. // Make sure not to exceed C boundaries

10. if (c_row < c->height && c_column < c->width) { 11. 12. sum = 0; 13. 14. for(int i=0; i < a->width; i++) { 15. 16. av = a->n[c_row * a->width + i]; 17. bv = b->n[i * b->width + c_column]; 18. sum += av * bv; 19. } 20. 21. c->n[c_row * b->width + c_column] = (float)sum; 22. } 23. }

Test and result

A similar modification was made to the kernel in GPU.NET and the running times are shown in

the following table.

Operations/ms Gigaflops/sec CPU running

(outer loop) 349.15 ms 366,609 0.37 250.90 ms

Cuda 53.04 ms 2,413,297 2.41 253.14 ms

GPU.NET 143.00 ms* 895,104 0.90 372.00 ms

Table 6 - Test result of matrix-multiplication for resulting matrix on platform #1

The running time of both the GPU.NET and Cuda implementation has decreased. GPU.NET

performs about 2.86 times better than the GPU.NET outer loop implementation, however the

performance increased is disappointing when comparing the performance gain when purely

using Cuda. When looking solely at the Cuda implementation the performance increase is

almost 6.6 times better than the outer loop approach.

So even though the GPU.NET performance, compared to Cuda, is disappointed, the

performance gains are significant, and indicate indeed that more threads doing less by

unrolling a loop is a reasonable approach.

When programming for parallel execution on the CPU platform, it is important to use the

correct amount of threads to solve the problem optimally, and as spawning a thread is

expensive not to many threads should be used. The results from these tests show that this

rule of thumb does not apply to the GPU platform. The overhead for creating a thread in

Cuda, is far less than that of creating threads on the CPU.

Another factor is the amount of global memory reads. Reading from global memory is

expensive and very slow [5]. When looking at the code of the kernel, it shows that the inner

loop makes two global memory reads and one multiplication and addition operation. This

equals a CGMA ratio of approximately 1.0.

On platform #1 the global memory has a peak performance of 16.6 GB/sec bandwidth. With 4

bytes in each single-precision floating-point value, the expected giga single-precision data per

second is 4.15 (16.6/4). With a CGMA ratio of 1.0, this kernel will not execute at no more

than 4.15 gigaflops [4].

So in short, this kernel is memory-bound and to optimise a memory bound kernel, the focus

should be on global memory access. One method for doing this is to

6.3.2 Tiling v1

One of the fastest memory types on a Cuda device, is the shared memory. The shared

memory is on-chip and very fast, but also limited. Shared memory is accessible and shared by

all threads in a block, so it is obvious to use it as a block cache.

One strategy for reducing global memory traffic is to partition data into tiles that will fit into

the shared memory. Then load a tile of data from device memory into shared memory,

process the data and lastly write the results back to device memory [2]. One important

criterion is that the computation on these tiled data must be able to, be processed

individually.

This requires the threads in a block to be synchronised, as shown in the following kernel code:

1. __global__ 2. void matrixMultiplicationTILINGns(float* a, float* b, float* c, int

aWidth, int bWidth) { 3.

4. // blockDim.x = TILING_DIM (last is defined and hence faster)

5. // blockDim.y = TILING_DIM (last is defined and hence faster) 6.

7. int bx = blockIdx.x;

8. int by = blockIdx.y;

9. int tx = threadIdx.x; 10. int ty = threadIdx.y; 11. 12. // Matrix C coordinates 13. int c_column = bx * TILING_DIM + tx; 14. int c_row = by * TILING_DIM + ty; 15. 16. // Calculate the first index in of row in a, and the last for the 17. // current thread 18. int aIdxBegin = c_row * aWidth + tx; 19. int aIdxEnd = aIdxBegin + aWidth - 1; 20. int bIdxBegin = c_column + bWidth * ty; 21. 22. float sum = 0.0; 23. 24. for (int aIdx = aIdxBegin, 25. bIdx = bIdxBegin; aIdx <= aIdxEnd;) { 26.

27. __shared__ float ac[TILING_DIM][TILING_DIM]; // A cache 28. __shared__ float bc[TILING_DIM][TILING_DIM]; // B cache 29. 30. // Load values to cache 31. ac[tx][ty] = a[aIdx]; 32. bc[tx][ty] = b[bIdx]; 33. 34. // Synchronze to make sure all threads in block have saved 35. // values to the shared memory for this phase 36. __syncthreads(); 37. 38. for (int i=0; i < TILING_DIM; ++i) { 39. sum += ac[i][ty]*bc[tx][i]; 40. } 41. 42. // Synchronise to make sure that computation are done 43. __syncthreads(); 44. 45. aIdx += TILING_DIM; // Add index by phase dimension 46. bIdx += TILING_DIM*bWidth; // Add index by phase dimension and 47. // b width 48. } 49. 50. // Insert dot-product in resulting matrix

51. c[c_row * bWidth + c_column] = sum;

Looking at the Cuda kernel the CGMA is calculated by:

$%� ∗ &1�%% + 1� ��)�� * ∶ 2,�� Where $%� stands for block dimension size and is the axis size of one dimension in the block. ,�� stands for global memory read and is the number of accesses to global memory.

The block dimension was set to 20, giving a CGMA of 20, and with a giga single-precision data

per second of 4.15 for platform #1, the immediate kernel peak performance is calculated to

83 gigaflops. This is an impressive theoretical peak performance of this kernel when taking in

to account that the global memory has a bandwidth of 16.6 GB/sec. However the GPU of

platform #1 has a peak performance of 34.38 gigaflops and the kernel is limited by that, so

the theoretical maximum performance of this kernel on this platform is 34.38 gigaflops.

Showing that the kernel on platform #1 is limited to 34.38 gigaflops proves that the tile-

strategy kernel algorithm is no longer memory-bound, but actually arithmetic-bound. This is

theoretically true, but the picture might be different when the test has been performed and

the result is ready.

Test and result

To test whether using the matrix structure as parameter had any impact on performance in

Cuda, I implemented two tiled kernels. The first one used the matrix structure as parameter

and the second used pointer arrays.

GPU.NET supports shared memory as well, however only arrays with one dimension were

supported. So the shared memory indexes in the source code were adjusted, to align the

arrays sequentially. Besides this minor adjustment the Cuda kernel was easy to port to

GPU.NET and the test result are shown in the following table.

Operations/ms Gigaflops/sec CPU running

Cuda 39.52 ms 3,238,368 3.24 257.92 ms

(No struct) 35.96 ms 3,559,595 3.56 254,48 ms

GPU.NET 76.00 ms* 1,684,210 1.68 357.00 ms

Table 7 - Test result of matrix-multiplication for tiling strategy on platform #1

The block has two dimensions with the length of 20, this gives 20 ∗ 20 = 400)ℎ��%��$��.. The normal recommendation is to make the block size

dividable by the warp size, currently 32. However these tests were also performed with a

block size of 16 ∗ 16 = 256)ℎ��%��$��., but the peak performance for the Cuda kernel,

was about 1.6 gigaflops. The conclusion is, sometimes it pays of not following the

recommendation. In this case, the overhead of filling the warp with empty padded threads is

insignificant, when compared to the larger amount of coalesced memory reads, the larger

block size results in.

By using the tile strategy and shared memory, it was possible to perform matrix-

multiplication even faster than the resulting matrix algorithm. Cuda was about 1.47 times

faster and GPU.NET was faster by a factor of 1.88.

Looking at the peak performance, the result indicates that even though the algorithms are

the same, then GPU.NET have no chance of performing on the same level as when Cuda is

used directly. This is most likely due to the fact that GPU.NET JIT compiles the device code.

The Cuda kernel has a peak performance of 3.56 gigaflops which is remarkably slow,

compared to the theoretical 34.38 gigaflops. This gives an actual performance that is just

10.35% of the theoretical possible. And even though the kernel algorithm is arithmetic-bound,

due to this significant slower performance, the performance limiting factors can in fact be

both arithmetic and memory.

6.3.3 Tiling v2 with latency hiding

When a kernel does not reach the expected performance level, a good place to start is to

analyse the kernel with focus on coalesced memory access. Note that even though shared

memory is fast, access should still be optimised with regards to coalescing access. The

following kernel is the result of such an analysis:

1. __global__ void matrixMultiplicationTILINGns_v2(float* a, float* b, float* c, int aWidth, int bWidth) {

3. // Declare cache

4. __shared__ float ac[TILING_DIM][TILING_DIM];

5. __shared__ float bc[TILING_DIM][TILING_DIM]; 6.

7. // Calculate Matrix C coordinates

8. const int c_column = blockIdx.x * TILING_DIM + threadIdx.x; 9. const int c_row = blockIdx.y * TILING_DIM + threadIdx.y;

10. const int cidx = c_row * bWidth + c_column; 11. 12. // Calculate the first index in of row in a, and the last for the 13. // current thread 14. const int aIdxBegin = c_row * aWidth + threadIdx.x;

15. const int aIdxEnd = aIdxBegin + aWidth - 1; 16. 17. float sum = 0.0; 18. 19. for (int aIdx = aIdxBegin, 20. bIdx = c_column + bWidth * threadIdx.y; aIdx <= aIdxEnd;)

21. ac[threadIdx.y][threadIdx.x] = a[aIdx]; 22. aIdx += TILING_DIM; // Increase a index 23. 24. bc[threadIdx.y][threadIdx.x] = b[bIdx]; 25. bIdx += TILING_DIM*bWidth; // Increase b index 26. 27. // Synchronze to make sure all threads in block have saved 28. // values to the shared memory for this phase 29. __syncthreads(); 30. 31. // Compute dot-product 32. for (int i=0; i < TILING_DIM; ++i) { 33. sum += ac[threadIdx.y][i]*bc[i][threadIdx.x]; 34. } 35. 36. // Synchronise to make sure that computation are done 37. __syncthreads(); 38. } 39. 40. // Insert dot-product in resulting matrix

41. c[cidx] = sum;

To optimisation this v2 kernel, four register variables have been removed to increase the

number of possible active warps. Furthermore, the shared memory access in the lines 21 and

25 has been optimised for coalescing. This now yields a peak performance of 5.028 gigaflops

on platform #1.

Nvidia provides a code example implementation of matrix-multiplication using the tile-

strategy. This kernel has a peak performance of 4.91 gigaflops, so with these minor updates it

is possible to get a kernel to performing better.

6.3.4 Tiling v3 with prefetching

Access to global memory is limited by bandwidth and high latency. The high latency of access

to global memory makes kernel execution halt until the data is served. By organising the code

such that when a thread is waiting for data, other instructions can be executed is called

latency hiding.

One way of hiding latency from memory access is to exploit data prefetching. This basically

works by pre-fetching data while the current data is being processed. The following pseudo

code shows the steps of using pre-fetching for matrix-multiplication:

1. __global__ void mm_prefecth(float* a, float* b, float* c, int aWidth, int bWidth) {

2. 3. // Load data from global memory to register variables

5. while(data to process) { 6.

7. // Insert register values to shared memory

9. // Synchronze threads 10. 11. // Prefetch next values to register 12. 13. // Calculate the dot-product 14. 15. // Synchronise to make sure that computations are done 16. } 17. 18. // Insert dot-product in resulting matrix 19. }

By exploiting prefetching the peak performance of matrix-multiplication increased for system

#1 and #4. 5.68 gigaflops was reached for system #1, while system #3 surprisingly incurred a

performance loss of 2.2 gigaflops.

6.3.5 Tiling v4 and v5 with more output per thread

Vasily Volkov from UC Berkeley has looked deeper into “Better performance at Lower

Occupancy” [6]. He has shown that it is possible to increase performance of matrix-

multiplication by making a thread compute several dot-products, instead of one.

Volkov points in particular two things out. By making a thread compute more output, a thread

can reuse values in the register for several computations. Registers are faster than both

shared and global memory and by grouping the work of several threads together, it is possible

to use the register for data sharing. The hypothesis is that for memory heavy kernels, it is

beneficial to let fewer threads carry a higher workload, to exploit the fast register for data

sharing.

In the matrix-multiplication kernel this has another advantage, as the dot-product is being

calculated between a single column in B and several rows in A, the memory read access to

this specific column in B is reduced by 3 42 . Consider the following inner loop, which computes

the dot-product for the different rows. The green memory read operations is the same and

hence automatic cached.

1. // Calculate the dot-product 2. for (int i=0; i < TILING_DIM; ++i) {

3. sum[0] += ac[threadIdx.y][i] * bc[i][threadIdx.x]; 4. sum[1] += ac[threadIdx.y+5][i] * bc[i][threadIdx.x];

5. sum[2] += ac[threadIdx.y+10][i] * bc[i][threadIdx.x];

6. sum[3] += ac[threadIdx.y+15][i] * bc[i][threadIdx.x];

I made two tests, one where a single thread computes two dot-products, and another where a

thread computes four dot-products. The results of platform #1 were a bit surprising and have

therefore also been tested on platform #4.

Kernel/System #1 (CC v1.1) #3 (CC v1.3) #4 (CC v2.0)

Tiling v3

Prefetching

5.68 gigaflops

(Occupancy: 54%)

113.52 gigaflops

34.57 gigaflops

(Occupancy: 81%)

Tiling v4

2 outputs/thread

4.45 gigaflops

(Occupancy: 58%)

96.51 gigaflops

36.89 gigaflops

(Occupancy: 88%)

Tiling v5

4 outputs/thread

5.89 gigaflops

(Occupancy: 67%)

113.83 gigaflops

37.13 gigaflops

(Occupancy: 67%)

Table 8 - Tiling with 2 and 4 outputs per thread comparison for different platforms

The performance expectations set by the Volkov slides say that several outputs/thread could

perform better than 1 output/thread. The results from platform #4 are the only one following

this pattern, ending with a peak performance of 37.13 gigaflops. This performance is

achieved by having an occupancy level of 67%, confirming that higher performance can be

achieved with a lower occupancy rate.

But it is evident that using Volkov’s suggestion is not a certain measure for success. It is

interesting to see that the performance of the 2 outputs/thread method actually was lower

than expected for system #1 and #3, and the 4 outputs/thread on these systems yielded

results similar to those when not using Volkov’s suggestions at all.

The major difference between system #1 and #3 on one side and #4 at the other is the

different GPU devices compute capability level. System #1 and #3 belongs to 1.x where #4

belongs to the 2.x generation. This might be the reason that the algorithm performs relatively

different on different architectures. I will in the next paragraph look into whether the CC

level has any influence on the performance of al algorithm.

6.3.6 Cuda compute capability

The features, capabilities

capability. The first Nvidia graphic cards were released with a

v1.0 to v1.1, the newest card today are released with a

a superset of features of those at a lower lev

generation GPU.

The newer CC levels both add new features and improve on existing. One of the important

factors in performance

been improved in many ways for CC 2.0 and 2.1

need use as much energy on

it easier to port existing

The following table shows the

different CC levels. The results can also be found in appendix F.

Figure 7 - Performance of kernels executed for different CC levels on

CC 1.1 in gigaflops

Cuda compute capability

capabilities and instruction set of a GPU are specified by its compute

The first Nvidia graphic cards were released with a compute capability

, the newest card today are released with a CC level at 2.1.

a superset of features of those at a lower level [4], and a higher level also

both add new features and improve on existing. One of the important

in performance is the performance impact of non-coalescing m

been improved in many ways for CC 2.0 and 2.1, which means that the developer does not

use as much energy on designing an algorithm with memory coalescing in mind, making

existing algorithms.

The following table shows the peak performance in gigaflops of different

The results can also be found in appendix F.

Performance of kernels executed for different CC levels on

CC 1.1 in gigaflops CC 1.3 in gigaflops CC 2.0 in gigaflops

specified by its compute

compute capability level from

level at 2.1. A higher level defines

also indicates a newer

both add new features and improve on existing. One of the important

memory access. This has

means that the developer does not

with memory coalescing in mind, making

different kernels targeting

Performance of kernels executed for different CC levels on platform #4

CC 2.0 in gigaflops

It was expected that the kernels would be best performing on higher CC levels, and the rule

of thumb is to target the highest CC level possible when compiling kernels, to take advantage

of the newest optimisations and features. This is true except for the last kernel, where CC

levels 1.1 and 1.3 have a peak performance of 39.47 gigaflops, which is faster than the 37.06

gigaflops that CC 2.0 delivers.

Kernels executed on CC levels 2.0 are in generally between 5% and 10% faster than lower

levels, except the last case described above that is 6.11% slower. I have not been able to find

any explanation why the rule of thumb is not valid for this specific case, but even though this

exception breaks the rule, I do still recommend compiling for the highest CC level possible.

6.4 Evaluation

The key to a good performing kernel is memory coalescing and latency hiding. The first step

should be to structure the algorithm so that most possible memory coalescing is achieved.

Tiling has proven a very good strategy for increasing memory coalescing. A memory access

limitation is due to the DRAM memory design; so memory coalescing optimisation techniques

should also be applied to shared memory.

Using matrix-structures as parameters was initially thought of as a good abstraction, but test

shoved that performance losses where the result. It is therefore recommended to use

primitive variables or pointers in the kernel function definitions.

Data prefetching combined with operations reordering in the kernel was used to hide latency,

and gave diverging results. On three systems a performance of about 1 gigaflop was achieved,

but on the Tesla C1060 device, a loss of 2.2 gigaflops was the result. Maybe the hardware of

the Tesla card is already optimised with regard to hiding this type of latency, so trying to

handle this in the kernel counteract these hardware optimisations. In any way, this shows

that data prefetching should be carefully applied to a kernel.

Volkov presented ideas that latency-hiding and the exploitation of registers can achieve a

higher peak performance. This proved to be true, and by making a thread do more work was

it possible to make a kernel perform even better. This optimisation technique furthermore

showed that the occupancy rate should not necessarily be relied on for an optimisation

strategy.

Control flow is another factor to keep in mind when designing a kernel. As the Cuda

architecture is a SIMT, and if a kernel has a complex control flow, then several runs by the

warp scheduler can be necessary to complete the warp. This can unwarrantedly result in a

longer computing time.

The numerical precision for the same operations processed on a GPU and a CPU does not

always yield the same result. This is especially true for older architectures that have a

compute capability levels between v1.0 and v1.3. Newer architectures better support the

IEEE 754 standard and yields in many cases a result with better precision. In the tests of

matrix-multiplication the maximum differences in values was 0.013. To minimise this

inaccuracy, special and slower intrinsic functions can be used in the kernel. These functions

have less deviations from IEEE 754 and forces the compiler not to use FMAD instructions,

which are fast multiply-add instructions, but imprecise.

7 LU decomposition

LU-decomposition, also called LU-factorisation, is a linear algebra matrix decomposition of a

matrix A in the form:

� = 34

Where L and U are lower and upper triangular matrices [17]. If the LU factorisation is known,

it can be used to solve matrix-vector linear equations in two steps:

�5 = $ 6)��1:38 = $6)��2: 45 = 8

Decomposing matrix A to a product of L and U can be achieved by using an enhanced version

of Gauss elimination. Only a square matrix can be decomposed, and the L and U matrices are

of the same size, as shown here:

9�� : = 9 1 0 0�� 1 0�� 1: 9 �� 0 �� 0 0 ��:

Note that L is a unit one matrix, meaning the diagonal elements �;; of L are all one. Stewart

provides a sequential algorithm that builds on Gauss elimination, which also creates a lower

unit one matrix.

7.1 Analysis

Stewart designs an LU-decomposition algorithm and provides the code that overwrites the

matrix A with its LU factorization [17].

Pivoting or row interchanges may be required for two reasons, firstly to ensure the existence

of a LU factorisation and secondly to increase the numerical stability of the Gaussian

elimination algorithm [18]. For simplicity algorithms without pivoting will initially be

analysed, but when testing, only algorithms that implement partial pivoting will be used. This

makes sure that the performance and correctness of the individual algorithms can be

compared.

The sequential algorithm, without pivoting, consists of three loops that overwrite the existing

matrix m. The matrix is vectorised and the index of an element is found by � ∗ � + � where r

is the row index, w the width of the matrix and c the column index.

1. // Core algorithm for LU Decomposition 2. for (int k = 0; k < n; k++)

4. for (int i = k + 1; i < n; i++) 5. {

6. // Compute scale factor Rik

7. float Rik = (m[i * mWidth + k] /= m[k * mWidth + k]);

8. 9. // Subtract row k elements from row i elements with the

10. // Rik scale factor 11. for (int c = k + 1; c < n; c++) 12. { 13. m[i * mWidth + c] -= Rik * m[k * mWidth + c]; 14. } 15. } 16. }

The code shown above is without pivoting for simplicity reasons. The sequential

implementation is simple, consisting of three loops using the operations divide, multiplication

and addition. The data size of the square matrix is � and requires � 32 additions and

multiplications, and � 22 divisions to complete. Ignoring the lower order term, the running

time is O( �) where n is both the width and height of the matrix [19]. This shows that the

running time increases more than the data size.

7.1.2 Parallelism

Matrix-multiplication and the characteristics of its data access meant that inducing

concurrency and exploiting data-parallelism was straightforward. The same cannot be said

about LU-decomposition, in which data dependencies between the loops makes parallelising

more complicated.

The sequential algorithm consists of three loops, and the operations performed by the two

inner loops results in a asymptotical running time as shown:

< � ,)�� = � 22 + � 32 = =( �) Where n is the width or height of the matrix. The outer loop cannot be directly performed in

parallel, as the .; iteration depends on the results from the.;>�,.;>�....� iterations.

Parallelism is not impossible, but the order of the outer loop is vital. Taking the outer loop

into account, the operations part of the algorithm can be written as:

? 22 + � 32@AB�

This equation says, for each step then 22 + � 32 operations are performed, and these

operations can be performed in parallel. The required number of operations when taking

parallelisation into account is:

∗ 22 + � 32C whereCis the number of processes. The optimal execution performs all tasks possible in

parallel, for this algorithm the optimal number of processes isC = − 1. But for simplicity,

let’s set it toC = . =�)�� ,)�� = ∗ 22 + � 32 = =( �)

So this algorithm does have a parallel potential, I will now look into whether this potential

can be utilised by the Cuda architecture.

Multipliers and row operations

The parts of the algorithm that exhibit no or little parallelism should be processed on the

CPU. This means the outer loop is processed by the CPU and the inner parts that exhibit

parallelism will be processed by the GPU. So the LU-decomposition implementation should be

processed by the CPU and GPU in correlation. The interesting point will be to see if it is

possible, when also considering data transfer, to make the GPU assist the CPU, to accelerate

the execution of the LU-decomposition algorithm.

To make the initial implementation simple, I have divided the calculation of multipliers and

the row operations into separate tasks. For each step of the outer loop, the multipliers are

calculated for the current column (line 2 to 4), after which the multipliers are used to

calculate the elements in the upper triangular matrix (line 5 to 9). These are individual task

that can be performed in parallel as shown in the following pseudo code.

1. For k from 1 To n-1 2. For i in { k+1, ...,n }

3. LU[i][k] = LU[i][k] / LU[k][k] 4. End

5. For j in { k+1, ...,n }

6. For i from k+1 To n 7. LU[i][j] = LU[i][j] – LU[i][k] * LU[k][j]

8. End

9. End

10. End

This algorithm is not the only method for creating the LU-factorisation, there are other

algorithms that structure the operations in the outer loop differently, and the main

difference is their memory access patterns. The performance of using different memory

access patterns may wary depending on different memory types used in the algorithm,

another factor is the whether the tasks are fine-grained or coarse-grained. For now I

recognise the existence of other algorithms, but use the one described for the simple

implementation.

A simple version of the LU-decomposition algorithm was implemented and tested. The

optimisation steps uncovered from the matrix-multiplication implementations on Cuda, will

be used to optimise the LU-decomposition implementation. Performance and correctness

tests will be performance and compared with different CPU algorithms, including with and

without pivoting.

7.2.1 The algorithm

The sequential algorithm described in the analysis is used to calculate the LU matrices on the

CPU, and serves as a comparison for the GPU computed result. This algorithm does not allow

the same level of parallelism as matrix-multiplication, but parts of the algorithm can be

parallelised. The following sample shows the code that performs the outer loop on the host,

and makes calls to kernels processed by the device. Line 7-9 shows an optional call to a

device pivoting kernel with a running time of O( ). This composition ensures that the order of

the outer loop is maintained, and the remaining tasks are performed in parallel.

1. for (int k = 0; k < (int)a->width; k++) {

2. 3. // setup execution parameters, for (int i = k + 1; i < n; i++)

4. int threads = a->width - k;

5. int gridX = (threads + THREADS_PER_BLOCK-1) / THREADS_PER_BLOCK; 6.

7. if (pivot) {

8. lud_simple_pivot<<< 1, 1 >>>( d_lu, lu->width, lu->height, k);

9. } 10. 11. // Calculate scale factors for column k 12. lud_simple_calc_scale_factor<<< gridX, THREADS_PER_BLOCK >>>( d_lu,

lu->width, lu->height, k);

13. 14. // Calculate new columne values with scale factor 15. lud_simple_compute_row<<< gridX, THREADS_PER_BLOCK >>>( d_lu, lu-

>width, lu->height, k);

16. 17. }

The function call in line 12, calculates the multipliers of a given column on the device. Line

15 performs the row operations with the multipliers. The kernels being called and their logic

are shown here:

1. __global__ void lud_simple_calc_scale_factor(float *lu, int luWidth, int luHeight, int k) {

3. int tid = threadIdx.x + blockIdx.x * blockDim.x; 4. int i = k + 1 + tid;

6. if (i < luHeight) 7. {

8. // Calculare rik scale factor and insert to Lower triangle

9. lu[i * luWidth + k] /= lu[k * luWidth + k];

10. } 11. } 12. __global__ void lud_simple_compute_row(float *lu, int luWidth, int

luHeight, int k) { 13. 14. // Id of the row 15. int tid = threadIdx.x + blockIdx.x * blockDim.x; 16. int i = k + 1 + tid; 17. 18. if (i < luHeight) { 19. 20. // Load rik scale factor, can be cached in shared memory 21. float rik = lu[i * luWidth + k]; 22. 23. // Subtract row k elements from row i elements with the Rik

scale factor

24. for (int c = k + 1; c < luWidth; c++) 25. { 26. lu[i * luWidth + c] -= rik * lu[k * luWidth + c]; 27. } 28. } 29. }

Tests were performed with the data sizes 400, 2000, 4000, 6000 and 10,000, where the data

size is both the width and height of the matrix being decomposed. The element values are

randomly generated, which unlikely though can mean that there is no LU-factorisation result.

Pivoting, meaning row interchanges, are applied to both ensure numerical stability, but also

to ensure that a LU-factorisation do exist.

Multipliers and row operations

The test was performed on the four different platforms, as shown in the following graph.

Figure 8 – Performance of simple LU-decomposition on different platforms.

The graph indicates that the peak performance measure in gigaflops is dependent on the data

size. The performance increases for matrices of increasing sizes up to about 2000 × 2000, and after that the performance result levels for all platforms. The peak GPU performance of the

fastest platform was about 0.5 gigaflops, which compared to a peak CPU performance of 2.44

gigaflops is slow.

0 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 10.000

Data size

Platform #1 Platform #2 Platform #3 Platform #4

Kernel invocation overhead

If pivoting is used, then for each step.;, three kernels are invoked. For a10,000 ×10,000matrix this equals to 30,000 kernel invocations. Naturally, kernel invocations will incur

overhead, but how much will 30,000 invocations influence the total result? Vasily Volkov et

al. has measured the kernel launch overhead for various systems and GPUs [20]. For

synchronised kernel invocations the times measures were between 10-14 µs, for asynchronous

kernel invocations the timings were 3-7 µs. The following table shows the kernel invocation

overhead as a ratio of the fastest running times on system #3 and #4.

System #3 #4

Fastest result (n=10,000) 15,736.55 ms 32,784.64 ms

Asynchronous (low=900 ms) 5.72% 2,75%

Asynchronous (high=2,100 ms) 13.34% 6.41%

Synchronous (low=3,000 ms) 19.06% 9.15%

Synchronous (high=4,200 ms) 26.69% 12.81%

Table 9 - Kernel invocation overhead ratio of total running time

This table shows that kernel invocations should not be disregarded when implementing an

algorithm, because their contribution to the total running time can be relatively high. This

obviously depends on the algorithm, but for this LU-decomposition implementation, the

contribution is as high as 26.69%.

It is evident that asynchronous invocations are faster than synchronous, and hence represent

a lower percentage of the total running time. So where possible asynchronies functions should

be used

GPU.NET

The LU-decomposition implementation requires a matrix to initially be copied to device

memory, then several kernel calls compute the result and updates the data, before the

matrix is copied back to the host. GPU.NET does currently not allow data on the device to be

modified by multiple kernel calls, so testing LU-decomposition through GPU.NET would not be

relevant, as data transfers would severely impact performance.

7.3 Block LU-decomposition

One of the first optimisation strategies, suggested by the matrix-multiplication chapter, was

to divide the problem into smaller pieces that fit into caching memory. Jack Dongarra et al.

have developed a block LU algorithm, called the right looking algorithm, that automatically

supports tiling. Please note that the term block in a block algorithm is not the same as the

blocks that are part of a Cuda grid, and used to define thread granularity. A Cuda block will

from here on be referred to as a thread block.

7.3.1 The block algorithm

By partitioning a F ×F matrix A, the factorisation LU may be partitioned as shown [22]:

The usual rules of matrix-multiplication hold for block matrices, so we can write:

G 1. �II = 3II4II2. ��I = 3�I4II3. �I� = 3II4I�4. �� = 3�I4I� + 3��4��J

Where� × �is the block size, �II is� × �, �I�is� × (F − �), ��I is (F − �) × � and ��is(F −�) × (F − �). The first step is based on lemma 1 and 2, by performing a normal LU-decomposition on �II and ��I combined, the result is then3II, 4IIand3�I, which are then known.

Step 2 uses lemma 3 and a triangular solve method, which results in the matrix4I�. In step 3, rearranging lemma 4 gives3��4�� =�� −3�I4I� =��K , which shows that 3�� and 4�� can be found by LU-decomposing��K . This can be achieved by using the above steps

on��K .

In the F/� number of steps the matrix A has been decomposed by using a block LU-

decomposition, as depicted here. The white parts have already been solved.

Figure 9 – Matrix A being decomposed by block LU-decomposition in steps.

Figure 9 shows that the height and width of the matrix is F, . is the current step and block width is obviously the dimension of the current block being processed (here the green sub-

matrix). Step 1 solves the green and purple sub-matrix by regular LU-decomposition, then the

lower triangular matrix of the block (L of the green block) is used to triangular solve the cyan

sub-matrix, as the second step. In the third step, the blue sub-matrix is found by regular

matrix-multiplying the purple and cyan sub-matrices and subtracting the element values from

the current elements in the blue sub-matrix. The steps are then continued for the remaining

parts until the whole F ×F matrix is processed.

As this algorithm makes it possible to partition large matrices and solve smaller parts, and

therefore exploit shared memory, this algorithm will be implemented using Cuda and used for

testing.

7.3.2 Implementation

This algorithm, as shown above, consists of three steps, which the implementation must also

follow. The first step, to LU-decompose F × �, is covered by an optimised kernel of the

simple algorithm (lud_block_scale). This part also includes the optimised pivoting kernels

(lud_block_pivot, lud_block_pivot_L2 and lud_block_swap).

The second step requires a triangular solving kernel (lud_block_triangular_solve), and the

last step is regular matrix-multiplication kernel (lud_block_matrixMultiplication), which

has already been implemented and optimised in chapter 6 from page 29.

Pivoting

In LU-decomposition, pivoting is performed for each column. Instead of the simple algorithm

that had a running time proportional to , the parallel nature of Cuda can be exploited to implement a reduction pivoting algorithm with a running time of O(log(n)). This required two pivoting kernels; the first reduces the current column of the matrix and

saves the result to a temporary pivoting array on the device. The second kernel does the

same, but works on the temporary pivot array instead of the matrix.

The first kernel is shown here, and has already been optimised with focus on memory

coalescing and a sort of tiling strategy. In line 20 and 21 the individual threads loads a value

from global memory to shared memory. This data is then processed from line 27 to 37, while

threads synchronise data access for consistency. In line 42, the first thread of each thread

blocks in a grid, writes the pivoting index to the temporary array.

1. __global__ void lud_block_pivot(int *out, float *a, int M, int k, int max)

3. extern __shared__ float shared[]; 4. float* max_cache = (float*)shared;

5. int* idx_cache = (int*)&shared[blockDim.x];

6. 7. unsigned int tx = threadIdx.x;

8. unsigned int i = blockIdx.x * blockDim.x + tx + k; // Get row index

10. unsigned int idx = i * M; 11. 12. // Clear cache for threads that exceeds max + they should not 13. //influence result 14. max_cache[tx] = 0; 15. idx_cache[tx] = -1; 16. 17. if (i < M) 18. { 19. // Read value + set row index 20. max_cache[tx] = abs(a[idx + k]); 21. idx_cache[tx] = i; 22. 23. // Sync threads to make sure all other also have loaded values 24. __syncthreads(); 25. 26. // Do the actual pivot finding 27. for(unsigned int stride = blockDim.x/2; stride>0; stride>>=1) 28. {

29. if (tx < stride && (stride+tx+k) < M && max_cache[tx] < max_cache[tx + stride])

30. { 31. max_cache[tx] = max_cache[tx + stride]; // Update value 32. idx_cache[tx] = idx_cache[tx + stride]; // Update index 33. } 34. 35. // Sync threads 36. __syncthreads(); 37. } 38. 39. // The first thread should write result from block to output 40. if (tx == 0) 41. { 42. out[blockIdx.x] = idx_cache[0]; // Load index to output 43. } 44. } 45. }

Swapping rows

If a pivoting row has been identified, the indices of the two rows are then transferred to the

device, by calling a kernel for swapping rows. By swapping the rows on the device, a transfer

of the matrix to and from the host is avoided. Several threads can be used to swap the rows,

and by aligning the memory access correctly, the memory access is coalesced.

LU-factorisation

The algorithm described by Stewart [17] is parallelised by making threads process individual

rows. To optimise the performance the .)ℎ row is loaded to shared memory and accesses by

all threads, as the grid size is 1 × 1. This means that the.)ℎrow is only loaded to shared memory ones, and the values are

accessed by all threads. With just one block, the disadvantage is that Cuda are not able to

hide memory latency access by switching to other active blocks. I will later test whether this

approach performs well, or another approach with several thread blocks performs better.

Triangular solving

Triangular solving for 4I�can be performed row- or column wise. I have chosen column wise

as the memory access of the threads in a block will be coalesced. This part is well suited for

GPU processing, because each column can be processed independently.

1. __global__ void lud_block_triangular_solve(float *a, int M, int k, int LU_BlockDim)

2. { 3. extern __shared__ float y[];

5. int tx = threadIdx.x; 6. int tid = blockIdx.x * blockDim.x + tx;

7. int column = tid + k + LU_BlockDim;

9. if (column < M) 10. { 11. for (int r = 0; r < LU_BlockDim; r++) // For each row in block 12. { 13. float res = a[(r+k) * M + column]; 14. for (int c = 0; c < r; c++) // 0<=c<r, so below diagonal 15. res -= a[(r+k) * M + c + k] * y[tx * LU_BlockDim + c]; 16. y[tx * LU_BlockDim + r] = res; 17. } 18. 19. for (int r = 0; r < LU_BlockDim; r++) 20. a[(r+k) * M + column] = y[tx * LU_BlockDim + r]; 21. } 22. }

Each thread uses shared memory to calculate the resulting column values (line 3, 15, 16 and

20). The size of the thread blocks, and thereby the needed shared memory, is not known

compile time. But as shared memory can be dynamically allocated, this is not a problem.

Shared memory is fast, but the register is even faster. If the thread block size was known on

compile time, it could prove beneficial to use the register instead of the shared memory.

Matrix-multiplication

The kernel for performing matrix-multiplication is based on tiling v3, which includes tiling,

pre-fetching and memory coalescing optimisations. For any details about this kernel please

turn to paragraph 6.3.4 on page 42.

The initial block algorithm was tested on the four different platforms and with data sizes

ranging from 400 × 400 to 10,000 × 10,000 matrices. For comparison the same data sizes was

tested on the CPU.

The test with the largest matrix was not performed on platform #1, due to memory

limitations, neither was it tested by the CPU, as the running time would be too high. For

comparison I have added a qualified projection on the graph, which shows that platform #3

had a peak performance of 14.37 gigaflops for a matrix 10,000 × 10,000.

Figure 10 - Performance of block LU-decomposition v1 on different platforms.

The graph shows two important things. Firstly, the GPU architecture is in fact able to perform

LU-decomposition faster than the CPU, for larger matrices. The specific speed of the platform

determines when and for which data sizes, the GPU is faster, but looking at the graph, this

happens somewhere between 1,000 and 3,000. Secondly, the peak performance of the

algorithm is almost proportional to the data size, for these tests. Obviously the peak

performance cannot keep increasing proportionally to the data size; there must be an upper

limit. But it makes sense that when increases, even more operations can be performed in

parallel.

Several of the tests were also performed by the CPU as a comparison. The average difference

was 0.104211 and the maximum and minimum differences were respectively 0.332855 and

0.0. The 0.0 differences were only achieved on platform #4 with a compute capability of 2.1.

Profiling

The Nvidia Compute Visual Profiler is a tool that allows profiling of a Cuda program. The GPU

time summary plot indicates which part of an algorithm that could be optimised with most

effect. The following figure shows how much computing time each kernel uses.

0 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 10.000

Data size

Platform #1 Platform #2 Platform #3 Platform #4 CPU

Figure 11 - Computing time of each kernel in block LU-decomposition v1 on platform #4.

Almost 90% of the time is spent in the regular LU-decomposing kernel, so optimising this part

should have the best effect on the total running time.

7.3.4 Optimising round 1

Having determined that the kernel lud_block_scale has the best performance optimisation

potential, I will now analyse the source code to identify performance limiting factors.

The kernel is, for each .)ℎ iteration, called with one thread block and the LU block size

inthreads, which normally is 20. This means that only 20 threads are running in parallel at

any given time, but as the warp size for the G80 and GT200 architecture is 32, the active

warp is padded with 12 empty threads that do not process any data.

1. __global__ void lud_block_scale(float *a, int M, int k) 2. {

3. extern __shared__ float ac[];

4. 5. int aWidth = M;

6. int tx = threadIdx.x;

7. int end = min( blockDim.x, M-k );

8. 9. ac[tx] = a[k * aWidth + k + tx]; // Load k row to shared memory, as

// it is used across threads

10. 11. // Sync threads to make sure all other also have loaded values 12. __syncthreads(); 13. 14. for(int i = k+1 + tx; i < M; i+=blockDim.x) { // Foreach row 15. 16. // Compute scale factor Rik, 1 operation=divide 17. float rik = (a[i * aWidth + k] /= ac[0]); 18. 19. for (int c = 1; c < end; c++) // Foreach column value in row

20. a[i * aWidth + k + c] -= rik * ac[c]; 21. } 22. }

Another factor in this kernel is its dependency on global memory. All threads load a value

from the .)ℎrow into shared memory in line 9, both the memory read from global memory

and the write to shared memory is coalesced, so this is good. The cached values are then

used to calculate the upper and lower triangular matrices from line 14 to 20. These loops rely

heavily on global memory access, and have only few operations to hide latency.

The core parts of the kernel are line 17 and 20. Consider line 17, a memory load and a write,

combined with a single divide operation gives a CGMA of 0.5. Line 20 has a memory load and

a write, together with the operations addition and multiply, which gives a CGMA of 1.0. On

platform #1 the global memory has a peak performance of 16.6 GB/sec bandwidth. The

expected giga single-precision data per second is 4.15 (16.6/4). Line 20 is the dominant part,

but line 17 cannot be ignored, so taking the CGMA ratio of 1.0 and 0.5 into account, this

kernel will execute at no more than between 2.075 and 4.15 gigaflops on platform #1 [4].

Hiding latency

The low CGMA suggests that this kernel is limited by memory, but in the current form, it is

possible to improve on latency hiding. One way is to assign more work to the streaming

processors and let them continue working on another warp, while the first warp waits for

data. The updated thread block size is 64, and the needed number of blocks would be

(F − .)/64, where M is the height the matrix and k the current iteration. The kernel looks

like this:

1. __global__ void lud_block_scale_v2(float *a, int M, int k, int end) 2. {

3. extern __shared__ float ac[];

5. int aWidth = M; 6. int tid = blockIdx.x * blockDim.x + threadIdx.x;

8. // Load k row to shared memory, as it is used across threads 9. ac[threadIdx.x] = a[k * aWidth + k + threadIdx.x];

10. 11. // Sync threads to make sure all other also have loaded values 12. __syncthreads(); 13. 14. int i = k+1 + tid; // Row index 15. if (i < M) 16. { 17. // Compute scale factor Rik, 1 operation=divide 18. float rik = (a[i * aWidth + k] /= ac[0]);

19. 20. for (int c = 1; c < end-k; c++) // Foreach column value in row 21. a[i * aWidth + k + c] -= rik * ac[c]; 22. }

But this is not the only benefit from this update. A for loop is a control flow element that

often is part of a kernel. When doing operation counting analysis of a kernel, the operations

contributed by for loops are often overlooked. Consider line 20 in the kernel above, for every

iteration the c++ operation and the c < end-k comparison is performed. Unrolling loops are

another way of increasing performance of a kernel, and this is exactly what has been

achieved with this kernel, compared to the former version’s line 14.

This updated block algorithm was tested again, on the four different platforms and with data

sizes ranging from 400 × 400 to 10,000 × 10,000 matrices. For comparison the same data sizes

was also tested on the CPU.

Figure 12- Performance of block LU-decomposition v2 on different platforms.

0 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 10.000

Data size

Platform #1 Platform #2 Platform #3 Platform #4 CPU

This optimised algorithm is faster than the former, now the peak performance of platform #3

is 31.51 gigaflops, which is 2.19 times as fast. This also shows that even smaller matrices can

with benefit be processed by the GPU architecture.

Several of the tests were also performed by the CPU as a comparison. The average difference

was 0.103360 and the maximum and minimum differences were respectively 0.314285 and

0.002296. The Cuda architectures with higher compute capability do not seem to have smaller

deviations from the CPU reference result.

Profiling

Focus should be on hiding latency, which can be achieved by increasing the number of active

blocks or by data pre-fetching. The following graph shows the updated computing time, when

the number of threads and blocks has been adjusted.

Figure 13 - Computing time of each kernel in block LU-decomposition v2 on platform #4.

This profiling result indicates that optimisation of the lud_block_scale and the

lud_block_matrixMultiplication kernels could have the highest performance effect. So this

is what I will look into as next step.

7.3.6 Optimising round 2

The matrix-multiplication kernel used has already been well optimised, as it is based on the

work and results from the matrix-multiplication analysis chapter 0 (from page 29). But for

simplicity reasons, this kernel was not optimised with Volkovs suggestion, several outputs per

thread. This will be the next step to test.

The details of how this was implemented have already been well described, so please refer to

the chapter about matrix-multiplication for any details about this optimisation.

The result of the optimised

#3, is now 42.89 gigaflops for a

Figure 14 - Performance of block LU

7.3.7 Further optimisation

The algorithm consists of 6 kernels covering pivoting and the three steps in block LU

decomposition. Pivoting, regular LU

optimised, the only kernel left in

will be one of two parts for an optimisation attempt. The other

lud_block_scale that

Triangular solve

The first version of triangular solve implementation

and thread block size, and even though, according to the profiling result above, an

optimisation only would

this kernel further. Being

peak performance; this would still be a good exercise

improvement of a kernel.

Platform #1

The result of the optimised algorithm is shown below. The peak performance,

42.89 gigaflops for a 10,000 � 10,000 matrix, which is an increase of 1.32 tim

Performance of block LU-decomposition v3 on different platforms.

Further optimisation

The algorithm consists of 6 kernels covering pivoting and the three steps in block LU

Pivoting, regular LU-decomposition and matrix-multiplication has been

optimised, the only kernel left in its original version is lud_block_triangular_solve

will be one of two parts for an optimisation attempt. The other part is the

that still accounts for the highest GPU time consumption

The first version of triangular solve implementation was well balanced with regard

ly would affect about 3% of the GPU running time, I chosen

Being fully aware about any optimisation would have a limited impact on

this would still be a good exercise and give insight to the analys

a kernel.

2.000 4.000 6.000

Data size

Platform #1 Platform #2 Platform #3 Platform #4

peak performance, for platform

ch is an increase of 1.32 times.

decomposition v3 on different platforms.

The algorithm consists of 6 kernels covering pivoting and the three steps in block LU-

multiplication has been

lud_block_triangular_solve, which

part is the optimised kernel

he highest GPU time consumption.

was well balanced with regards to grid

chosen to try and optimise

aware about any optimisation would have a limited impact on

and give insight to the analysis and

6.000 10.000

Platform #4 CPU

The focus was on unrolling a loop (line 20 was in former version performed in separate loop)

and coalescing memory access (line 17 and 19 are now coalesced), and as expected the

optimisation did not yield any significant change in peak performance.

1. __global__ void lud_block_triangular_solve_v2(float *a, int M, int k,

int LU_BlockDim) 2. {

3. extern __shared__ float y[];

4. 5. int tid = blockIdx.x * blockDim.x + threadIdx.x;

6. int column = tid + k + LU_BlockDim;

8. if (column < M) 9. {

10. for (int r = 0; r < LU_BlockDim; r++) // For each row in block 11. { 12. int rkM = r+k*M; 13. float res = a[rkM + column];

14. 15. for (int c = 0; c < r; c++) // 0<=c<r, so below diagonal 16. res -= a[rkM + c + k] * 17. y[c * LU_BlockDim + threadIdx.x]; 18. 19. y[r * LU_BlockDim + threadIdx.x] = res; 20. a[rkM + column] = res; 21. } 22. } 23. }

The result was limited as expected, but if I were to improve this kernel further, then I would

focus on testing whether Volkov’s suggestion (1 thread = 2 output) would have any positive

effect. Another improvement would focus on the global memory access in lines 13 and 16.

The elements of the lower triangular matrix of the current block being processed (L of the

orange sub-matrix) could be copied to shared memory and the reads in line 13 and 16 could

be from the faster shared memory instead of global memory.

Figure 15 – Showing the sub-matrix part of the triangular solve method.

Regular LU-decomposition

Focusing on the kernel that accounts for the highest GPU consumption time makes sense if

this part of the algorithm can be further optimised. The former version of the kernel focused

on hiding latency by increasing the number of thread blocks. Hiding latency can also be

achieved by using data prefetching and by minimising the need for global memory access.

The performance of the kernel lud_block_scale was attempted to be improved by using

registers to hold indices computed several times, by using data prefetching and by applying

the Volkov suggestion (1 thread = 2 outputs). Unfortunately no performance gains were

established compared to the former versions, intact a minor performance loss proved to be

the reality.

Figure 16 – A 10.000 x 10.000 matrix LU-decomposed on platform #3 and #4.

So based on this results, sometimes when applying an optimisation method, the result is

actually a performance loss. The reason for this is covered by the fact that these methods

add restrictions to the kernel, which results in extra boundary checks being needed with an

increase in flow control complexity and operations.

So implementing improvements with care followed by testing should always be exhibited to

determine whether the improvement is actually needed, to yield a better performance.

Correctness

Several tests with different data sizes were performed by the CPU as a comparison for GPU

computed results. The average deviation was 0.062963 and the maximum and minimum were

respectively 0.309113 and 0.0. The 0.0 was only achieved by the platform #4 with a Cuda

compute capability of 2.1. The deviations increased proportionally to the data size, which

v4 v5 v6

Kernel editions

Platform #4

v4 v5 v6

Kernel editions

Platform #3

make good sense. Any inaccuracy effects the result for every iterations, the larger the matrix

size, the more iterations are needed.

7.3.8 Large matrices

The test results have so far indicated that the peak performance of the block LU-

decomposition was proportional to the data size, but there must be a maximum where the

architecture limits performance.

My curiosity drove me to find this limit, so I updated the program to support very large

matrices with sizes from 400 × 400 and up to 20,000 × 20,000. A matrix of this size requires

about 1525 MB of both host and device memory, which only platform #3 matches with 4GB.

Figure 17 - Peak performance of LU-decomposition v3 on platform #3

The graph above shows the peak performance in gigaflops of different data sizes. From the

results I have added a qualified projection for matrices up to 35,000 × 35,000 in size. The result and projections show that the v3 algorithm should have a peak performance of about

51-52 gigaflops on system #3.

It is difficult to calculate the theoretical performance of the LU-decomposition block

implementation, because the computations are divided into 6 different kernels. Each kernel

has its own CGMA and its share in solving the full problem, but I will try and approximate.

The kernels with a CGMA between 0.5 and 1.0 takes up 48% of the running time, the matrix-

multiplication kernel has a CGMA of 20 and make up about 17%. The remaining kernels have a

0 5.000 10.000 15.000 20.000 25.000 30.000 35.000

Data size

Platform #3

CGMA of about 1.0. These numbers have been retrieved from the Nvidia profiler shown in

Figure 13, and the approximated ranged result is found by these two equations.

QR �)�� , �� 0.5 ∗ 48% + 20.0 ∗ 17% + 1.0 ∗ 35% = 3.99QR �)�� , ℎ�,ℎ 1.0 ∗ 48% + 20.0 ∗ 17% + 1.0 ∗ 35% = 4.23 ≈ 4.0

So the approximated CGMA is 4.0. The peak global memory performance for platform #3 is

102.4 GB/sec, which gives a giga single-precision data per second of 25.6 (102.4/4). The theoretical peak performance of this algorithm on this platform is 102.4 gigaflops, for fully

coalesced memory access. The actual is about half, namely 52 gigaflops.

There are several factors influencing this result, one is the fact that not all memory load and

writes are coalesced, another factor is the extra instructions processed due to control flow

complexity. But the result indicates that memory access is not a limiting factor on

performance for these kernels, but something else is.

7.4 Evaluation

LU-decomposition algorithms have given some valuable insight to some optimisation methods

that work and some that does not.

Reducing the number of kernel invocations or using asynchronous functions, can reduce the

total running. With 30,000 kernel calls, the total invocation time could be reduced from 3-4.2

seconds to 0.9-2.1 seconds. I do not think that reducing kernel calls should be a primary

focus, but just something that a developer should be aware of when implementing an

algorithm for Cuda.

To base the implementation on a block algorithm increased the performance for two reasons.

First, the problem size is reduced to pieces that can exploit faster memory types, and second

the operation matrix-multiplication is highly parallel, and had already been optimised for the

Cuda architecture.

These tests also showed that when a kernel was invoked, using several thread blocks is better

than just using one. One reason could be that the warp scheduler can utilise multiple SMs for

solving the problem.

Unrolling a loop together with Volkov’s suggestion for matrix-multiplication also helped

increase performance. The last part is a bit surprising, because Volkov’s suggestion on matrix-

multiplication actually lead to a performance decrease on system #3.

Other tests revealed that data prefetching and the tiling strategy did not actually increase

performance, but left it without any major change. This fact promotes the notion described

above, that memory is not the limiting factor.

The correctness tests also confirmed that instructions have a higher degree of precision on CC

v2.0 than on earlier versions.

If I were to optimising LU-decomposition even further, I would focus on arithmetic

optimisations. This could be achieved by among others focusing on unrolling loops, minimising

control flow complexity and removing unnecessary synchronisation points.

8 QR decomposition

QR-decomposition, also known as QR-factorisation, is a decomposition of the � × matrix A,

with� ≥ , in the form:

� = X<

Where Q is an � ×�orthogonal matrix and R is an × upper triangular matrix. An

orthogonal matrix satisfies

XYX = Z Which implies

X>� = XY There are different methods for calculating the QR factorisation, which can be used to solve

linear systems and least squares problems [23][24].

8.1 Analysis

The different methods for decomposing matrix A into a QR factorisation include Gram-

Schmidt, Householder reflections and Givens rotations.

The classic Gram- Schmidt process is considered to subject to numerical instability. The

modified Gram-Schmidt algorithm overcomes this numerical instability but at the expense of

adding extra operations [23][25], I will for these reasons not consider the classic nor the

modified version.

Operations count analysis of both Householder reflections and Givens rotations show that

Givens rotations require about 50% more operations than Householder transformation [26].

Besides that, Givens rotations rely heavily on sine and cosine instructions, which will be

processed by the limited SFU. I have therefore decided to base the QR-decomposition

algorithm on Householder transformations.

But there are also other advantages; firstly, the parallelisation is similar to LU-decomposition,

why I expect draw on the parallel optimised experiences from the LU-decomposition chapter

[24]. Secondly, Householder QR can use a compressed data storage form, by using the original

matrix A and an additional array for the diagonal values of R [23]. Consider the matrix A in

Figure 18, the nonzero part of the vectors �; are stored in A along with the upper triangular

matrix R.

Figure 18 - Storage strategy for the compressed Householder QR-factorisation

The diagonal of R is stored in an extra vector. If the actualXorXYis ever needed, they can be computed from this compressed representation [25].

The sequential algorithm consists of 6 loops that overwrite the existing matrix with the

Householder vectors and the upper triangular matrix R. The diagonal elements of R are stored

in the array d. The elements of the matrix are stored vectorised in the array qr. Pivoting can

be used to ensure numerical stability, but has been left out for simplicity reasons.

1. // Core algorithm for QR Decomposition (Householder transformation) 2. for (unsigned int k = 0; k < n; k++) // For each column

3. { 4. // Compute 2-norm of k-th column

5. float sum = 0.0;

6. for (int r = k; r < m; r++)

7. sum += qr[r * n + k] * qr[r * n + k]; 8.

9. float nrm = sqrtf(sum);

10. 11. if (nrm != 0.0) 12. { 13. // Compute the kth Householder vector. 14. if (qr[k * n + k] < 0) 15. { 16. nrm = -nrm; 17. } 18. for (int i = k; i < m; i++) 19. { 20. qr[i * n + k] /= nrm;

21. } 22. qr[k * n + k] += 1.0; 23. 24. // Apply transformation to remaining columns. 25. for (int j = k + 1; j < n; j++) 26. { 27. float s = 0.0; 28. for (int i = k; i < m; i++) 29. { 30. s += qr[i * n + k] * qr[i * n + j]; 31. } 32. s = (-s) / qr[k * n + k]; 33. for (int i = k; i < m; i++) 34. { 35. qr[i * n + j] += s * qr[i * n + k]; 36. } 37. } 38. } 39. d[k] = -nrm; 40. }

This implementation is based on the Householder QR factorisation algorithm, which in the

central part has the following operation count per iteration [26]:

Dot products (Lines 7 and 30): 2 ∗ (� − .)( − .) Outer product (Lines 14-22): (� − .)( − .) Subtraction (Line 35): (� − .)( − .)

Including the outer loop, the total running time is:

?4(� − .)( − .)~@AB� 2� � − 2 �/3

This shows that the running time is O( �), and that the running time increases more than the

data size.

Pivoting can be used to increase numerical stability, but for simplicity, this has not been

included in this implementation.

8.1.2 Parallelism

QR-decomposition share the similarity of the outer loop with LU-decomposition, meaning, the

order of the outer loop is important and requires sequential processing. The algorithm can be

divided into the following tasks:

1. // Tasks in the algorithm for QR Decomposition 2. for (int k = 0; k < n; k++) // For each column

3. { 4. // Task 1: Compute 2-norm of k-th column

6. // Task 2: Compute the kth Householder vector. 7.

8. // Task 3: Apply transformation to remaining columns.

Each of the tasks above can be performed with varying parallel degree. So there is a parallel

potential for this algorithm, and the details will be described later.

A simple version of QR-decomposition was implemented and tested. The goal is to port the

algorithm to the Cuda architecture as fast as possible. Later, when the implementation is

functional, I will look at how to increase performance.

8.2.1 The algorithm

The sequential algorithm shown in paragraph 8.1.1 was implemented using regular C++, to

target the CPU architecture. This implementation was used to in the correctness test.

The GPU accelerated simple version was implemented based on the analysis from paragraph

8.1.2, and using the three identified tasks. To make the implementation of the three tasks as

easy as possible, the same procedure is used for all tasks, namely each task is handled by a

single thread block that holds 128 threads. The drawback is that when the problem size

becomes smaller than the number of threads, which happens when. < � − 128for task 1 and 2, there are generated a number of empty threads. This will only have an effect when

the last rows and columns are being processed, and for large matrices this will constitute a

relative small amount of the total running time.

Task 1 - Two-norm

The data size for each.]^step is% = � − ., which is the number of remaining rows. In the

sequential implementation, the running time is proportional to%. In this version, one thread block with 128 threads processes any%size, meaning the running time is proportional

to%/128. Task 2 - Householder vector

Each element in the Householder vector can be calculated independently, and this task

processes the remaining rows. The data size for each.]^step is% = − .. The sequential

implementation has a running time proportional to%, while this version as task 1, has a

running time proportional to d/128.

Task 3 – Transform columns

When task 1 and 2 have been performed, the rest of the matrix must be updated, which is the

remaining columns and rows. Each column can be generated independently, and each thread

of the 128, processes for each.]^step( − .) 1282 columns.

Tests were performed with data sizes 400, 2000, 4000 and 6000, data sizes that are both

height and width of the matrix being decomposed. Matrix elements are randomly generated.

All systems were used in the tests and the CPU was used to calculate a reference result. The

CPU results indicates that the CPU is slower when the matrix gets bigger, this makes good

sense as the CPU for large problem sizes is not able to exploit its caches. The performance of

the GPU gets better when the matrix size increases up to about 2000, then the processing

power performance evens out. System #3 is the only architecture that performs better than

the CPU, so this specific implementation does not yield an acceptable performance on the

The maximum difference from the CPU reference result was 0.000049 so the implementation

is considered acceptable accurate.

0 1000 2000 3000 4000 5000 6000

Data size

System #1 System #2 System #3 System #4 CPU

GPU.NET

QR-decomposition is not tested with GPU.NET for the same reasons as described in the LU-

decomposition chapter.

8.3 Optimisation

Task 1 - Two-norm

This task can be improved by using a parallel reduction algorithm. Doing so, make it possible

to decrease the asymptotical running time to O(log&%*). Task 2 - Householder vector

Each.]^step has a data size of% = − ., which the running time of sequential

implementation is proportional to. Each element in the Householder vector can be calculated

independently, making the asymptotical parallel running time about%/C, whereCis the number of processors. The maximum possible number of processes is%, making the

asymptotical parallel running time O(1) if a min. of − 1processes is available, if not the asymptotical running time is O(%). Task 3 – Transform columns

The number of operations required to update the remaining columns and rows of the matrix

equals:

2 ∗ % ∗ &� − .*Where%represents the number of columns − .that can be processed in parallel. So the parallel running time is:

2 ∗ %C ∗ &� − .*Where C is the number of processors. If a min. of processors are available the running time

for each step is proportional to 2 ∗ &� − .*, and� − .is the number of rows.

With these optimisations implemented, the tests were performed again for400 × 400, 2000 × 2000, 4000 × 4000 and 6000 × 6000 matrices.

The optimisations have increased performance on all systems. Matrices larger than

approximately1000 × 1000, now benefits from being computed using the Cuda architecture.

The peak performance for system #3 reached 1.42 gigaflops, not as impressive as the results

achieved by matrix-multiplication and LU-decomposition, but still about 5 times as fast as the

CPU. One of the reasons for this low performance is because the algorithm is not that suited

for a parallel architecture.

8.4 Block QR-decomposition

The algorithm used so far, relied heavily on vector operations. Matrix operations are much

better to exploit a parallel architecture, and such an algorithm has been designed by Susan

Ostrouchov et al [27].

8.4.1 The block algorithm

By partitioning aF × _matrix A, the factorisation QR may be partitioned as shown [27]:

Where� × �is the block size, �� is anF × �matrix containing the first�columns, and ��is an� × (_ − �) matrix containing the remaining columns. �� is� × �, ��is� × (_ − �), �� is (F − �) × � and ��is(F − �) × (_ − �).

0 1000 2000 3000 4000 5000 6000

Data size

System #1 System #2 System #3 System #4 CPU

�� = `��a � % �� = `��a The first step is to perform a regular QR factorisation on �� using a series of Householder

transformations of the form:

b; = Z − c;d;d;Y Where� = 1, … , �. The vectord; is of lengthFwhere the first� − 1elements is 0, and

the�]^ element is 1.

c; = 2 &d;Yd;*2

It can be shown that:

X = b�b�…b = Z − fgfY So in step 2, the triangular factorgof the block reflectorXis calculated. The triangular factor is used together with the transformation above, to update the remaining matrix.

In the F/� number of steps the matrix A has been decomposed by using a block QR-

decomposition, as depicted here. The white parts have already been solved.

Figure 19 - Matrix A being decompose by block QR-decomposition in steps.

Figure 19 shows theF × _matrix, . is the current step and block width is obviously the

dimension of the current block being processed (here the green and purple sub-matrix). Step

1 QR-decomposes the green and purple sub-matrix by regular QR-decomposition, then the

triangular factor is is used to transform the remaining columns in the cyan and blue sub-

matrix. The steps are then continued for the remaining parts until the whole M x M matrix is

processed.

This algorithm makes it possible to partition large matrices and solve smaller parts,

furthermore matrix operations are being used, that can utilise the parallel Cuda architecture.

8.4.2 Implementation

Implementation of this block algorithm has been challenging. The structure of the algorithm

resembles the block LU-decomposition, which was implemented and performed well. This

made me hope that the block QR algorithm also could be implemented and perform well.

Unfortunately this has not been the case.

Implementation of the algorithm was initially attempted for CPU processing. Thorough

debugging and testing have revealed that most of the algrithm works, and generates the

expected result. Regrettable, not all parts work as hoped. The poblematic part is related to

this transformation:

X = b�b�…b = Z − fgfY This transformation can be divided into three steps. Using explanation and figures from

above:

h ← fY�� This transformation is a matrix-multiplication between the transposed block of Householder

vectors and the sub-matrix��, consisting of the remaining columns in the matrix A. The result

is then written to the� × &_ − �*matrixh.

h ← gYh

Then the triangular gfactor of the block reflectorX, should be computed. After which its

transpose should be matrix-multiplied with the existing matrix h.

�j� ← `<��k<��k a = �� − fh

When the final matrix W is computed, it is used in a matrix-multiplication with the block

Householder vectors. The elements are then subtracted from the sub-matrix ��, which should

give ��k.

These steps should then be repeated until the complete matrix has been decomposed.

But numerous attempts at calculating the triangular g factor has failed, and the resulting matrix and array contains wrong values.8.5 Evaluation

The algorithms for LU- and QR-decomposition have a similar structure, so ideas from the LU

implementation was also applied to the QR implementations.

Optimising the running time of the different tasks proved to increase performance with a

factor of 3.84 times.

Unfortunately, due to the lacking implementation of a block QR algorithm, no further tests

were performance. This means that the full potential of QR on the Cuda architecture is still

to be unfolded.

Jack Dongarra, Susan Ostrouchov and others have designed this block QR algorithm. They are

highly competent people that have made contributions to Eispack, Linpack, BLAS, Lapack and

ScaLapack. The challenge with thegfactor is more than likely related to my implementation

rather than the algorithm.

9 Evaluation

The optimisation strategy described some methods and techniques that could be applied

when improving the implementation of the linear algebra algorithm. This evaluation

paragraph will summarise the findings and evaluate on the strategy.

9.1 Cuda

The Nvidia profiler can show relevant counters for both arithmetic and memory performance.

CGMA source code analysis can give valuable information about memory bandwidth as a

limiting factor.

The results from the tests suggest that a block linear algorithm is best suited for the Cuda

architecture. Such an algorithm is designed to divide data into sizes that fit into caches, such

as shared memory.

When implementations are to be optimised, the findings from this project suggest that tiling

is the best strategy, followed by latency hiding and coalescing memory access.

With regards to coalescing memory access, it should be mentioned that GPU architecture

designers are aware of the importance of this limiting factor, so newer GPUs are designed

with built-in optimised memory access. The impact of non-coalesced memory access should

therefore be of less importance in the future, and hence make porting of existing algorithms

easier.

In addition to the points above, here is a list with recommendations based on the findings of

the tests performed in this project:

• Avoid using structures as parameters in the kernel definitions, use instead simple

types or pointers thereof.

• Target the highest possible Compute Capability level. Among other things, the

precision of instructions are better and the result will be more accurate.

• Unroll loops, by making the threads fine-grained. Generation and thread scheduling

are cheap.

• Thread block size should be a multiple of the warp size (Currently 32).

• Be aware of the overhead for invoking a kernel.

• Note that default instructions deviate from IEEE 754, use specific IEEE 754 functions

for increased precision, but at the cost of speed.

Besides the list and suggestions above, there were also methods with doubtful results:

• The Volkov suggestion yielded performance gains on some systems, but lower on

others. Can be useful for low occupancy kernels, but should be tested and evaluated.

• Data prefetching can both increase and lower performance.

The underlying hardware and its capabilities play an important role whether an optimisation

technique affect performance. Some methods have positive effect on some GPUs, and a

negative on others. Analysing and testing should therefore always be performed.

9.2 GPU.NET

GPU.NET v1.0.3.5 was not mature and suffered from several bugs. The number of problems

makes it not recommendable for production environments. However, the latest release is

v2.0.14, which solves many of the bugs and problems I encountered.

The JIT compilation of kernels is a design decision that applies to all current versions of

GPU.NET. A JIT compilation is cached in-memory, and subsequent calls from the same process

will be served from this cache. It is therefore recommended, when using GPU.NET for large or

numerous problems, to warm-up both Cuda and GPU.NET. Do this by calling the kernel with a

small data size, subsequent calls will then be served faster.

10 Discussion and future work

This chapter begins with a discussion of the work and results in this project, and which fields

could be further researched. Then, the future of Cuda is discussed in comparison with GPU

code generation tools. After which a broader perspective on hardware development and

GPGPU in general, is discussed.

10.1 Project

A more thorough correctness test and analysis could further clarify the numerical stability of

the implementations used in this project. For example by comparing this projects results with

results from the widely recognised Matlab.

For this project I insisted on implementing all parts of the algorithms. A lot of work and

research have gone into the development of standard math libraries supporting the BLAS

interface. Implementing all parts gave valuable insight to the inner workings of the

algorithms, but possibly at the expense of performance. Using these libraries (e.g. Cublas or

Cula), could reveal the full performance potential of the different algorithms on the Cuda

architecture.

Testing performance, of other linear algebra algorithms, could serve as a frame of reference.

For example, how would Givens rotations affect the performance of QR-decomposition

instead of the Householder transformation method chosen? A more thorough analysis and

testing of the QR block algorithm would also be beneficial.

The optimisation strategy and the optimisation experiences could be applied on several other

linear algebra algorithms. An obvious extension would be the Singular Value Decomposition

(SVD).

10.2 Cuda

Cuda C and GPU.NET currently represents two different directions for utilising the Cuda

architecture. Cuda C is C/C++ and complex, whereas GPU.NET is .NET, uses code generation,

and is easier to use. You might say that GPU.NET is for developers that without too much

trouble, wants to accelerate their applications using parallel architectures. Cuda C, on the

other hand, is for developers that are not intimidated by C/C++ and tweaking.

Cuda C offers more flexibility, which enables better optimisation and higher performance,

but it does however not have to be a choice of either advantages. Cudafy.NET is a set of

libraries and tools supporting both directions.

Cudafy.NET can be used in the same way as GPU.NET, using full code generation. But it can

also just work as a bridge from .NET to Cuda C kernels. Cuda C optimisations are then

possible, while the invocation is carried out by the .NET runtime.

Uncovering the performance characteristics of Cudafy.NET e.g. using the linear algebra

algorithms from this project could be another valuable next step.

It is expected that Cuda will continuously be improved, e.g. by making the NVCC support C++

language features in kernels, allow better debugging in Nsight, and increase the language

support features in IDEs, to make development smoother.

With Cuda v4.0 the tools and drivers has been updated, and now enable a grid of machines

and GPUs to work together to solve large problems. This makes Cuda able to solve even larger

problems, than with former versions.

10.3 Hardware

The newer Cuda GPUs are becoming increasingly accurate, meaning the instructions are

performed with better numerical precision, at even faster speeds. Double-precision

instructions have been supported from Compute Capability 1.3, and it is expected this as

well, will become more precision together with faster processing times.

The future will surely also bring GPUs with even more cores and faster memory. Currently the

architecture of the Nvidia GF100 chips support up to 512 cores, but the dedicated GPU

computing system Tesla S2050 have 4 GPU with a total of 1,792 cores. Nvidia is not only

player when it comes to GPGPU. AMD has the FireStream architecture, and the top model

FireStream 9370 has 1,600 cores delivering 2,640 gigaflops.

Looking at the latest “TOP500 supercomputers list”, out of the top 5 the 3 are using Nvidia

GPUs. So Nvidia is a strong player, and I expect Nvidia and Cuda to play an important role in

the GPGPU field in the future.

10.4 Future of GPGPU

GPGPU development has for a long time been limited to first movers that saw a potential in

the high processing power that GPUs offer. Currently GPGPU is often used, where many

computations are needed. For example simulations of fluid or weather forecasting, or the

prediction of protein folding used by the pharmaceutical industry. But another and more

subtle application is slowly emerging.

A graphics card with a high performing GPU is a relatively cheap commodity, and many

regular computer systems are today equipped with a high performing GPU. Some application

developers have spotted this opportunity and now allow their application to be optionally

accelerated by the GPU. This is often completely transparent to the end user, but delivers an

increased application response time, which gives the user a better experience.

Applications that currently exploit this possibility includes, but are not limited to, different

browsers, such as Internet Explorer, Chrome and Firefox, and different video editing

applications.

11 Conclusion

Three frequently used linear algebra algorithms for matrix-multiplication, LU- and QR

decomposition was decided on for this project. They were described, analysed, and then

initially implemented using C/C++ for the CPU architecture.

The Cuda architecture and development platform was subsequently analysed and described.

Important features, characteristics and limitations were uncovered and an optimisation

strategy was formed.

Based on the analysis of the linear algebra algorithms and Cuda, implementation procedures

were designed. Then the algorithms were implemented targeting the Cuda architecture and

using C/C++ and Cuda C, after which they were tested. During this process different findings

were learned, which was subsequently used in combination with the Cuda optimisation

strategy to improve performance.

GPU.NET was used, where applicable, as a perspective on how to use Cuda from .NET.

Correctness tests were performed by comparing the results from the CPU with the results

from the GPU. The maximum differences documented the accuracy of the different

algorithms processed on various systems and GPUs.

The learning goals have all been achieved and the complete process has been documented in

this report.

Bibliography and references

1. Mikkel Bundgaard-Ovesen, Documentation of the GPUs usability in advanced parallel

calculations, 15th December 2010

2. Nvidia Cuda, Nvidia Cuda C Programming Guide, 9. November 2010

3. Desmond Fearnley-Sander, Hermann Grassmann and the creation of linear algebra,

December 1979

4. David B. Kirk and Wen-mei W. Hwu, Programming Massively Parallel Processors, 2010

5. Jason Sanders and Edward Kandrot, Cuda by example – an introduction to General-

Purpose GPU Programming, 2011

6. Vasily Volkov, Better Performance at Lower Occupancy (slides), 22nd September 2010

7. Vasily Volkov, Use registers and multiple outputs per thread on GPU (slides), 30th

June 2010

8. Geekbench, Performance of an Intel Pentium 4 3.06 GHz running Linux, Downloaded

3rd June 2011 (http://browse.geekbench.ca/geekbench2/view/209683)

9. Nvidia, Cuda C Best practices Guide, 20th September 2010

10. Paulius Micikevicius (Nvidia), Analysis-Driven Optimization (slides), 14th November

11. Sara Robinson, Toward an Optimal Algorithm for Matrix Multiplication, November

12. Ananth Grama et al., Introduction to Parallel Computing, 2nd edition, 26th January

13. Mary Jane Sterling, Linear Algebra for Dummies, 2009

14. Brian W. Kernighan and Dennis M. Ritchie, C Programming Language, 2nd edition, 1st

April 1988

15. John J. Barton and Lee R. Nacnman, Scientific and Engineering C++: An Introduction

With Advanced Techniques and Examples, 19th August 1994

16. Jens Eising, Lineær Algebra, 1999

17. G. W. Stewart, Afternotes on Numerical Analysis, 1996

18. E. E. Santos and M. Muraleetharan, Analysis and Implementation of Parallel LU-

Decomposition with Different Data Layouts, June 2000

19. Prof. Michael T. Heath, Parallel Numeric Algorithms: LU-Decomposition (slides), 2010

20. Vasily Volkov and James W. Demmel, Benchmarking GPUs to Tune Dense Linear

Algebra, November 2008

21. Vasily Volkov and James W. Demmel, LU, QR and Cholesky Factorisations using Vector

Capabilities of GPUs, 2008

22. Jack Dongarra et al., Derivation of a Block Algorithm for LU Factorization, 9th

February 1997

23. Peter J. Olver, Orthogonal Bases and the QR Algorithm, 5th June 2010

24. Prof. Michael T. Heath, Parallel Numerical Algorithms: QR-Factorization (slides),

25. Walter Gander, Algorithms for the QR-Decomposition, April 1980

26. Radu Trîmbitas, Householder Reflectors and Givens Rotations: Why orthogonality is

fine, 11th March 2009

27. Susan Ostrouchov, QR Factorization (a block algorithm), 28th April 1995

Appendix A – Project evaluation

The initial problem definition about linear algebra algorithms was updated during the project

period. In agreement with my supervisor Peter Sestoft, we decided to focus on linear algebra

algorithms for matrix-multiplication, and LU- and QR-decomposition. This clarification made

me able to focus on analysing Cuda features and limitations. My assessment is that this

elucidation made it possible to uncover Cuda characteristics in a details, that else was not

possible.

I am satisfied with the result of the project. The problem definition and learning goals were

all fully met, and the process and findings are all described in the report. But it was not

everything that went without challenges, let me elaborate.

To be able to implement an algorithm, a full comprehension of the algorithms and its inner

workings is necessary, this showed to be severely complicated with regards to LU- and QR-

decomposition. In addition to the linear algebra complications, add the difficulty of using a

new development architecture and programming language.

I can best describe this by comparing it to building a house of cards. The implementation

phase is represented by the last third of the house. So before being able to build the top of

the house, one needs to build the first 2/3, and before that, one needs to determine where

the house should be based.

My initial lack of knowledge of linear algebra, C and C++ meant that many resources were

invested in learning and gaining abilities. In spite of the initial research phase, I did

encounter situations during the project period, where my knowledge still did not suffice. As

mentioned above, this applied specifically to LU- and QR-decomposition. With QR it was

specifically the block algorithm that was difficult to comprehend.

During the 6 months that the project period lasted, I did learn a great deal. Learning goals

covering linear algebra and algorithms, together with Cuda, C and C++, defined the areas in

which I wanted extent my knowledge. As mentioned, I had only minor experience and no

qualifications in the fields prior to this project. So the learning requirement was high, and the

learning curve was steep, but I am satisfied with the result and the knowledge I have gained

will be beneficial in the future.

Appendix B – Implementation considerations

When the Cuda platform is utilised for processing, a new computing environment is

introduced into development and runtime. The host refers to the code and memory of the

CPU and the device refers to the code and memory of the GPU. The code and functionality

that exhibit little or no data parallelism are implemented in host code. The code and

functionality that exhibit rich amount of data parallelism are implemented in the device code

The host and device are two runtime environments that work independently. Communication

between the host and device is obviously necessary, as else, the CPU would not be able to

harnessing the GPU power of the Cuda architecture. In Cuda, the host is responsible for this

communication, which includes structuring data, allocate and releasing memory on device,

copy data to and from device as well as invoking the device kernel.

In addition to this, the host is responsible for configuring the device execution environmental

settings. Basically, specifying the number of threads the architecture should spawn to solve a

problem. The Cuda architecture allows, as shown in the following figure, threads to be

organised in blocks, and blocks to be organised in a grid.

Figure 20 - Cuda thread organisation [4]

Cuda thread organisation

A kernel is mapped to a grid, which is organised by blocks in two dimensions and a block can

hold threads in three dimensions. In the device kernel a block and thread is identified by the

following built-in variables:

Variable Description

gridDim.x Holds the number of blocks in the first dimension of the grid. Values are

valid in the range 1-65535.

gridDim.y Holds the number of blocks in the second dimension of the grid. Values are

blockDim.x Holds the number of threads in the first dimension of the block. Values are

blockDim.y Holds the number of threads in the second dimension of the block. Values

are valid in the range 1-512.

blockDim.z Holds the number of threads in the second dimension of the block. Values

are valid in the range 1-64.

blockIdx.x Hold the current blocks first dimension position in the grid. Values are valid

in the range 1-[gridDim.x].

blockIdx.y Hold the current blocks second dimension position in the current grid.

Values are valid in the range 1-[gridDim.y].

threadIdx.x Hold the current threads first dimension position in the current block.

Values are valid in the range 1-[blockDim.x].

threadIdx.y Hold the current threads second dimension position in the current block.

Values are valid in the range 1-[blockDim.y].

threadIdx.z Hold the current threads third dimension position in the current block.

Values are valid in the range 1-[blockDim.z].

Table 10 - Cuda built-in variables

Why has Nvidia designed a thread structure in up to five dimensions? Would it not be easier to

just use a single dimension?

For simple algorithms that only require a thread structure in one dimension, this can be

achieved. But there exists problems that naturally belong to a space of two dimensions or

more, e.g. a matrix. This structure is optional only, meaning the developer, and some

hardware limitations, decides how many dimensions to be used.

The total number of threads is a result of the following:

gℎ��%� = ,��%l��. 5 ∗ ,��%l��. 8 ∗ $��.l��. 5 ∗ $��.l��. 8 ∗ $��.l��. � Where $��.l��. 5 ∗ $��.l��. 8 ∗ $��.l��. � cannot exceed the total number of threads per

block GPU constraint. This is for most GPU’s 512. [5].

The size of the grid and blocks is often defined directly in the source code, but the optimal

size is in many cases directly dependent on the data size. This is not very flexible, as it means

that grid and block size would have to be adjusted, in the source code, for different data

sizes, and afterwards recompiled before execution.

There are different solutions to this. One way is to set the number high to cover most cases.

In the kernel one would have to check if the current thread actually has data to process like

so in line 6:

1. __global__ void kernel(float *data, int dataSize) { 2.

3. // Thread ID

4. int tid = threadIdx.x + blockIdx.x * blockDim.x; 5.

6. if (tid < dataSize) {

7. 8. // Process data

This is inefficient as many threads will be spawned but without any actual data to process.

Another way is to define the number of threads per block (e.g. 128), and then calculate the

number of required blocks from the data size. This makes sure that at most (threadsPerBlock-

1) threads are created without any data to process.

A third way is to calculate the grid and block size dynamically from the data size; this is

however difficult as the optimal setting is influenced by both the data size and the structure

of the algorithm.

Either of the second or third method can prove feasible, they both have pros and cons, but

which specific method to use, should be determined on a case by case basis.

SIMT and warp size

As mentioned earlier, threads are organised in blocks. But this is not the only organisation;

each block is partitioned into warps. A warp is a bundle of 32 threads being executed in

parallel.

These threads share a single instruction set, hence Cuda is a Single Instructions Multiple

Threads, also abbreviated SIMT, architecture. This is a design decision to reduce hardware

cost and to enable optimisations techniques, and it is not without relevance to the developer.

The SIMT architecture has some implications that will be discussed later.

The size of the warp is another important aspect to take into account when defining the grid

and block size. Consider the example where a problem is organised into 20 blocks each with

10 threads, giving a total of 20 x 10 = 200 threads. Cuda executes 32 threads in a warp in

parallel. In the example above, only 10 threads are available per block. Cuda will in this case

fill up the warp with 22 empty threads, resulting in 20 x 22 = 440 empty threads being

created. The block size should theoretically be defined to a number dividable by 32 [4].

Elapsed time

Measuring elapsed time is essential to measuring performance. Normal event timing in C and

C++ is CPU based, which is insufficient when dealing with the GPU. The GPU and CPU are

physically two independent processors, which run in parallel. The Cuda toolkit provides an API

for measuring GPU events and elapsed time.

The Cuda API will be used to measure memory allocation, copy of data from host to device,

the kernel execution time, copy of data from device to host and the release of memory.

These different timers will not just give the elapsed times of different operations, but actual

valued insight to the GPU performance. It will for instance be possible to calculate memory

transfer rates as well actual peak performance in gigaflops of the kernel.

In addition to valued insight, the timings can be used to measure relative performance gains

or losses, when certain properties or capabilities of the Cuda architecture have been applied

to the algorithms. In addition to measuring relative performance, the GPU timing will serve as

a base for comparison with the similar linear algebra processes on GPU.NET and the CPU.

Pinned or page-locked memory

A program that uses Cuda to harness the power of the GPU normally follows these steps:

1. Initialise

2. Copy date from host -> device

3. Process data on device

4. Copy data from device -> host

5. Release

The kernel has been the focus for optimisations and analysis so far, but there are other ways

of optimisation a program using Cuda. By using pinned or page-locked memory, higher data

transfers can be achieved between host and device.

On platform #1 the speed of memory transfer could be increased from about 1.5 GB/sec to 5

GB/sec. But caution should be exercised when pinned memory is used; excessive use can

reduce overall system performance as page-locked memory is scarce [9].

Matrix structure

Matrices are mathematical structure in two dimensions. In the computer memory this can

either be represented by 2-dimensional array or an array of arrays. Even though 2-

dimensional structures are available in computer memory, it is better to vectorise the matrix,

by aligning the rows after each other. Accessing a specific value �� in the vector of matrix A,

is performed like so: v[3 * Width + 2]. Where v is the vector of matrix A, and Width is the

column count of A. The Cuda architecture is designed to be stream based, so by vectorising

data for processing on the GPU platform, one uses Cuda as it was designed and intended.

For the code I use the following matrix structure to hold the vector and details about the

matrix.

1. typedef struct

2. { 3. float *n;

4. unsigned int width;

5. unsigned int height; 6. unsigned int size;

7. } matrix;

n is the pointer to the vector of float values, width is the number of columns, height the

number of rows and lastly the size if the length of the vector (height*width).

Appendix C – Hardware specification

description and analysis

In the following the specifications of the different machines will be described and evaluated

in terms of Cuda capabilities. The speed of the GPU and memory are measured and from that,

the memory bandwidth and gigaflops are calculated.

The GPU has historically been designed for performance and not precision, hence all gigaflops

calculations are based on single precision float operations. It is not until Cuda compute

capability (CC) 1.3 that double precision were supported, but with a significant performance

The following specifications are based on information and measurements by CPU-Z, GPU-Z

and Cpu Caps Viewer, as well as information about bus, FSB, PCI-E and more. The details are

meant to give a theoretical upper limit on performance, which can be used for comparison

with the results of the tests.

Platform #1

Apple Macbook 13” with Intel Core 2 Duo P8700 2,53 MHz processor, 4GB DDR3 ram on 533

MHz, a Nvidia GeForce 9400m and a Front-side-bus (FSB) on 1066 MHz. The GPU on the

machine has the following specifications:

Cores 16

Memory interface 128-bit

Memory bandwidth (internal/external) 8GB/sec, 16,6 GB/sec

Graphics bus interface (PCI-E v2.0) 8 GB/sec

Transistors 282 Million

Core clock 450 MHz

Shader Clock 1100 MHz

Memory Clock 1066 MHz (533 MHz double pumped)

Gigaflops 51,56

Cuda Compute Capability 1.1

Table 11 - GPU specifications for Nvidia GeForce 9400m, platform #1

Platform #2

Apple iMac 24” with Intel Core 2 Duo E8435 3.06 GHz, 4 GB DDR2 ram on 399 MHz, a Nvidia

GeForce 8800 GS and FSB on 1066 MHz. The GPU on the machine has the following

specifications:

Cores 96

Memory bandwidth (internal) 49,94 GB/sec

Memory bandwidth (external) 6,23 GB/sec

Graphics bus interface (PCI-E v1.1) 8 GB/sec

Transistors 754 million

Core clock 500 MHz

Memory Clock 800 MHz

Gigaflops 234,38

Table 12 - GPU specifications for Nvidia GeForce 8800 GS, platform #2

Platform #3

Is a machine with a Nvidia Tesla C1060 GPU. The exact machine specifications have not been

available, however the specifications for the C1060 GPU gives some hints on the performance.

Cores 240

Memory bandwidth (internal) 102,4 GB/sec

Transistors 754 million

Core clock 602 MHz

Memory Clock 1600 MHz

Gigaflops 933,12 for Total(Mul+Add+Special Function)

622,08 for Total(Mul+Add)

Table 13 - GPU specification for Nvidia Tesla C1060, platform #3

Platform evaluation

You may wonder why platform #3 has two different gigaflops. The first is based on the

specifications of the G80 and the descending architectures, which says that a GPU is capable

of performing a Multiply-Add instruction dual-issued with a special function instruction per

operation cycle. The second is based on the newer Fermi architecture specifications, in which

a operation cycle can perform a Multiply-Add instruction dual-issued.

That a newer architecture supposedly is slower than an older one, not only contradicts the

logic of development and improvement, but it is not so. The G80 based architectures are

equipped with streaming processors (SP) and separate special function units (SFU). The SP

combined with the SFU gives theoretically 3 operations per clock cycle; however basing a

gigaflops calculation on these specifications makes the result very theoretical. Calculating the

gigaflops performance according to this may be correct, but does not yield an achievable

result. The reason is surely a result of Nvidias competition with other GPU manufactures, to

produce a GPU with the highest gigaflops count.

Most development and testing are performed on platform #1, so this platform will serve as a

Specifications

This paragraph will dig a little deeper into the specifications of the hardware, and describe

the theoretical performance limits. When dealing with GPU’s the most important are memory

transfer rates and GPU gigaflops. The relevant elements in question are chipset, front-side-

bus (FSB), memory speeds and the GPU.

Figure 21 – Block diagram of a chipset. Source: Intel

Chipset

The chipset consists of a north- and a south bridge. The north bridge is responsible for

handling the exchange of data between the CPU, memory and the graphics adapter. The

south bridge handles exchange of data with external devices like audio, network, hard discs

and USB devices. The north bridge is the most data intensive and relevant for this project,

whereas the south bridge is not in used for GPU accelerated applications.

The bus speed of platform #1 is 266 MHz with a multiplier of 4, making the rated FSB about

1066 MHz. The width of the bus is 64-bit making the transfer rate:

g�� m��)�&n�/�* = o6� ∗ � ��%)ℎ8 ∗ 1024

The transfer rate is then 8.33 GB/s.

Figure 22 - CPU and bus details of platform #1

Memory

Memory transfer occurs when data is copied from host to device and again when the result is

copied from device to host. This data is transferred via the chipsets north bridge from the

CPU/system memory to the device memory. The GPU of Platform #1 has no dedicated

memory and uses the system memory. The transfer rate is of that reason equal to that of the

system memory.

The system memory consists of two DDR3 modules whose peak transfer rate is double that of

the FSB (double data rate), meaning 16.66 GB/s.

Figure 23 - System memory details of platform #1

Grahpics adapter

The software GPU-Z reports the GPU of platform #1 to run on a PCI port. But this cannot be

true as the transfer rate would be about 2 GB/sec. My guess is, as the Nvidia specification

says, it runs on a PCI Express 2.0 bus interface with a peak transfer rate of 8 GB/s one way,

which by the way is the same as the memory transfer.

Figure 24 - GPU details of platform #1

The gigaflops count is calculated by:

n�,�m�� = nC4�� ∗ 6��%� Fb� ∗ =��)�� 8��1024

Platform #1 has 16 cores with a shader speed of 1100 MHz. The number of operations are

theoretically 3 (Mul+Add+SF), which in terms result in a gigaflops count of 51.56.

Evaluation

Development and testing will mainly be performed on platform #1 and #2, even though they

lack the extreme computing power platform #3 posses. The purpose of this project is to test

the applicability of GPGPU for solving different problems, and the focus is furthermore on

testing relative performance gains or losses of different optimisations techniques.

Platform #3 will however give an important insight into the performance of solving these

problems on a massive parallel architecture. Platform #3 furthermore supports a higher

compute capability, which makes even more optimisation techniques available, as well as

double precision operations.

Appendix D – Development environment

problems and solution model

Making VS2008 ready for Cuda development was a challenge. The following is a description of

the problems experience, and the solution used.

Development model

Cuda toolkit version 3.2 supports Microsoft Visual Studio 2005 (VS2005) and Visual Studio 2008

(VS2008). It is possible to enable development in Visual Studio 2010 (VS2010), but has proven

difficult to setup. This is among other reasons, due to the fact that the Nvidia Cuda compiler

(NVCC) requires either a Visual C++ version 8 or 9 compiler.

I have tried to set Cuda up for VS2010, but the trouble have led me to the conclusion, that

the problems and minor inconveniences far exceed any gains achieved by using VS2010.

The Nvidia GPU computing SDK, a separate package, provides help, tutorials, utility helpers

as well as code examples. With this package, all the hard work of configuring Visual Studio,

setting up paths and environment variables are done for you. However with it follows libraries

packed with utility and helper functions and references to other libraries.

Performance is important in this project, and there is no possibility to say what impact any

reference libraries or any utility functions might have, which is why I have decided to create

a new and clean project model, that can serve as a base for the performance tests in Cuda.

By doing so, I get valuable insight of the structure of the toolkit and its applicability.

Cuda C and C++

Cuda code however has to be written in Cuda C, a language based on ANSI C. Host code on

the other hand does not necessarily need to.

The language C has evolved since its initial release, and C++ provides new features and

updated libraries. It can therefore be desirable to use a mix of C and C++ when coding for the

host, while device code must be written en C.

In the project model C++ code must be contained in .cpp files and C for Cuda in .cu files.

With the correct configuration it is possible to make NVCC compile .cu files and Visual C++ 9.0

compile the rest. The compilation linker’s responsibility is then to link the compiled objects

and functions into a single executable file.

A description of what steps I had to take, and the project model can be found on my blog:

http://blog.ovesens.net/2011/05/cuda-v3-2-template-project-using-cpp/

Appendix E – CGMA and Cuda profiler

Being able to optimise requires event measuring or profiling. But event measuring or profiling

required knowledge of what to profile. The Nvidia paper on “Analysis Driven Optimization”

[10] identifies four categories of what can limit a kernels performance; memory throughput,

instruction throughput, latency or a combination of the above.

To achieve the best performance it is important to strike a perfect balance between

instructions:bytes ratio, also called compute to global memory access (CGMA) [4]. Two

methods should be applied when trying to determine any optimisation possibilities, source

code analysis and tool profiling.

By looking at the source code, the developer can analyze whether a kernel is memory or

instruction bound, and whether the ratio between these two is limiting the performance.

To measure events, Nvidia provides a tool called “Compute Visual Profiler” that provide

different counters. Different profile counters are available for GPU’s of different CC levels, a

complete list and description can be found in the “Compute Visual Profiler User Guide”.

The higher the compute capability the more detailed and accurate counters, but that does

not mean that this projects development hardware, with compute capability of 1.1, does not

provide any counters with valuable insight. These are shown and described in the following

table.

Counter Description

divergent branch Number of divergent branches within a warp. This counter is

incremented by one if at least one thread in a warp diverges

(that is, follows a different execution path) via a data

dependent conditional branch. The counter is incremented by

one at each point of divergence in a warp.

instructions Number of instructions executed.

gld uncoalesced Number of non-coalesced global memory loads. Number of non-

coalesced global memory loads.

gld coalesced Number of coalesced global memory loads.

gst coalesced Number of coalesced global memory stores.

local load Number of local memory load transactions. Each local load

request will generate one transaction irrespective of the size of

the transaction.

local store Number of local memory store transactions; incremented by 2

for each 32-byte transaction, by 4 for each 64-byte transaction

and by 8 for each 128-byte transaction for compute devices

having compute capability 1.x. It is incremented by 1

irrespective of the size of the transaction for compute devices

having compute capability 2.0.

Table 14 - Selected profile counter from Compute Visual Profiler User Guide

These profile counters can give valuable insight to what a kernel actually do, but they cannot

be used without consideration, Nvidia writes the following:

Compute Visual Profiler values are best used to identify relative performance

differences between un-optimized and optimized code.

But holding the profiled numbers together with analysed numbers presents a good estimate of

how much bandwidth is wasted by suboptimal coalescing of memory access [9].

Appendix F – Matrix-multiplication CC levels

Kernel CC 1.1 CC 1.3 CC 2.0

Resulting matrix 7.80 gigaflops 7.80 gigaflops 14.76 gigaflops

Tiled v1 19.12 gigaflops 19.11 gigaflops 19.86 gigaflops

Tiled v2

Latency hiding 31.97 gigaflops 31.95 gigaflops 33.57 gigaflops

Tiled v3

Prefetching 32.83 gigaflops 32.84 gigaflops 34.57 gigaflops

Tiling v4

2 outputs/thread 33.46 gigaflops 33.43 gigaflops 36.90 gigaflops

Tiling v5

4 outputs/thread 39.47 gigaflops 39.43 gigaflops 37.06 gigaflops

Results from the matrix-multiplication compute capability levels test on platform #4.

GPU Accelerated Linear Algebra

Documents

Transcript of GPU Accelerated Linear Algebra

GPU-Accelerated Path Rendering - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S... · GPU-Accelerated Path Rendering . Mark Kilgard •Principal System

GPU-Accelerated Applications for HPC Industries| NVIDIA · GPU‑ACCELERATED APPLICATIONS CONTENTS ... (back-end generates CUDA and ... GPU-Accelerated Applications for HPC Industries

Accelerated Stereoscopic Rendering using GPU

GPU ACCELERATED DATA SCIENCE - hpfokus.dk

Accelerated CCGPS Coordinate Algebra / Analytic Geometry A ...€¦ · Accelerated CCGPS Coordinate Algebra / Analytic Geometry A ... Accelerated CCGPS Coordinate Algebra / Analytic

The GPU Accelerated Database

NVIDIA GPU Accelerated Applications Catalog

University of Tsukuba’s Accelerated Computing · Accelerated Computing TaisukeBoku ... n Base cluster with commodity GPU cluster technology ... intra-node GPU-GPU data copy

GPU-accelerated k-mer counting

GPU-Accelerated Science on Titan - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 9. 11. · GPU-Accelerated Science on Titan - GPU Technology Conference 2012 Author:

Toward efficient GPU-accelerated

GPU Accelerated Libraries

GPU-Accelerated Fluid Dynamics

GPU accelerated path rendering fastforward

PG-Strom - GPU Accelerated Asyncr

GPU-Accelerated Multiphysics Simulation

GoAi and PyGDF: GPU-accelerated data science with Jupyter ... · • Support for specific “string” dtype with GPU-accelerated functionality similar to Pandas • Accelerated Data

GPU accelerated OpenFOAM simulations with PETSc4FOAM

GPU-Accelerated Large Scale Analytics · GPU-Accelerated Large Scale Analytics ... the GPU-accelerated version can ... for a broad base of users. While massively parallel data management

GPU Accelerated AES