Post on 05-Mar-2015
description
i
Table of contents
Table of contents .............................................................................................. i
List of figures................................................................................................... v
List of tables................................................................................................... vi
Summary .........................................................................................................1
1 Introduction ..............................................................................................3
1.1 Motivation ...........................................................................................3
1.2 Reading guide ......................................................................................3
1.3 Problem definition .................................................................................5
1.4 Method ...............................................................................................6
1.5 Scope.................................................................................................6
1.5.1 Algorithms ....................................................................................6
1.5.2 Numerical stability ..........................................................................6
1.5.3 IEEE 754 and double-precision ............................................................7
1.5.4 BLAS ............................................................................................7
2 Background ...............................................................................................8
2.1 Linear algebra ......................................................................................8
2.2 GPU computing .....................................................................................8
3 Parallel platforms..................................................................................... 10
3.1 Cuda ................................................................................................ 10
3.1.1 History ....................................................................................... 10
3.1.2 Version....................................................................................... 10
3.1.3 Cuda program .............................................................................. 11
3.1.4 Architecture ................................................................................ 13
3.1.5 Limitations .................................................................................. 16
3.2 GPU.NET ........................................................................................... 17
3.2.1 Overview .................................................................................... 17
3.2.2 Development ............................................................................... 18
ii
3.2.3 Execution ................................................................................... 19
3.2.4 Limitations and bugs ...................................................................... 19
3.2.5 Evaluation ................................................................................... 20
4 Hardware platform ................................................................................... 21
4.1 Analysis ............................................................................................ 22
4.2 Benchmarking .................................................................................... 23
4.2.1 Memory performance ..................................................................... 23
4.2.2 Arithmetic performance .................................................................. 24
5 Implementation ....................................................................................... 26
5.1 Development environment ..................................................................... 26
5.2 Design decisions .................................................................................. 26
5.3 Optimisation ...................................................................................... 27
5.3.1 Strategy ..................................................................................... 27
6 Matrix-multiplication ................................................................................ 29
6.1 Analysis ............................................................................................ 30
6.1.1 The sequential algorithm ................................................................ 30
6.1.2 Parallelism .................................................................................. 31
6.2 Simple algorithm ................................................................................. 32
6.2.1 The algorithm .............................................................................. 32
6.2.2 Test and results ............................................................................ 33
6.3 Optimisation ...................................................................................... 36
6.3.1 Unroll loop with threads ................................................................. 36
6.3.2 Tiling v1 ..................................................................................... 38
6.3.3 Tiling v2 with latency hiding ............................................................ 41
6.3.4 Tiling v3 with prefetching ................................................................ 42
6.3.5 Tiling v4 and v5 with more output per thread ....................................... 43
6.3.6 Cuda compute capability ................................................................. 45
6.4 Evaluation ......................................................................................... 46
7 LU decomposition ..................................................................................... 48
7.1 Analysis ............................................................................................ 48
iii
7.1.1 The sequential algorithm ................................................................ 49
7.1.2 Parallelism .................................................................................. 49
7.2 Simple algorithm ................................................................................. 51
7.2.1 The algorithm .............................................................................. 51
7.2.2 Test and results ............................................................................ 53
7.3 Block LU-decomposition ........................................................................ 55
7.3.1 The block algorithm ....................................................................... 55
7.3.2 Implementation ............................................................................ 56
7.3.3 Test and results ............................................................................ 59
7.3.4 Optimising round 1 ........................................................................ 61
7.3.5 Test and results ............................................................................ 63
7.3.6 Optimising round 2 ........................................................................ 64
7.3.7 Further optimisation ...................................................................... 65
7.3.8 Large matrices ............................................................................. 68
7.4 Evaluation ......................................................................................... 69
8 QR decomposition ..................................................................................... 71
8.1 Analysis ............................................................................................ 71
8.1.1 The sequential algorithm ................................................................ 72
8.1.2 Parallelism .................................................................................. 73
8.2 Simple algorithm ................................................................................. 74
8.2.1 The algorithm .............................................................................. 74
8.2.2 Test and results ............................................................................ 75
8.3 Optimisation ...................................................................................... 76
8.3.1 Test and results ............................................................................ 76
8.4 Block QR-decomposition ........................................................................ 77
8.4.1 The block algorithm ....................................................................... 77
8.4.2 Implementation ............................................................................ 79
8.5 Evaluation ......................................................................................... 80
9 Evaluation .............................................................................................. 81
9.1 Cuda ................................................................................................ 81
iv
9.2 GPU.NET ........................................................................................... 82
10 Discussion and future work .......................................................................... 83
10.1 Project ............................................................................................. 83
10.2 Cuda ................................................................................................ 83
10.3 Hardware .......................................................................................... 84
10.4 Future of GPGPU ................................................................................. 84
11 Conclusion .............................................................................................. 86
Bibliography and references ............................................................................... 87
Appendix A – Project evaluation .......................................................................... 89
Appendix B – Implementation considerations .......................................................... 90
Cuda thread organisation ............................................................................... 91
SIMT and warp size ....................................................................................... 93
Elapsed time ............................................................................................... 93
Pinned or page-locked memory ........................................................................ 94
Matrix structure ........................................................................................... 94
Appendix C – Hardware specification description and analysis ..................................... 95
Platform #1 ............................................................................................. 95
Platform #2 ............................................................................................. 96
Platform #3 ............................................................................................. 96
Platform evaluation ................................................................................... 97
Specifications .......................................................................................... 98
Evaluation ............................................................................................. 101
Appendix D – Development environment problems and solution model .......................... 102
Development model ..................................................................................... 102
Cuda C and C++ ....................................................................................... 102
Appendix E – CGMA and Cuda profiler .................................................................. 104
Appendix F – Matrix-multiplication CC levels ......................................................... 106
Appendix G – Report page count ......................................................................... 107
v
List of figures
Figure 1 - Cuda program sequence diagram ............................................................. 12
Figure 2 - Cuda architecture with four multiprocessors .............................................. 14
Figure 3 - How GPU.NET works as describe on TidePowerd.com .................................... 17
Figure 4 – Simplified diagram of a chipset .............................................................. 22
Figure 5 - Matrix-multiplication process depicted ..................................................... 29
Figure 6 – The output of the console testing program ................................................ 34
Figure 7 - Performance of kernels executed for different CC levels on platform #4 ............ 45
Figure 8 – Performance of simple LU-decomposition on different platforms. .................... 53
Figure 9 – Matrix A being decomposed by block LU-decomposition in steps. ..................... 56
Figure 10 - Performance of block LU-decomposition v1 on different platforms. ................. 60
Figure 11 - Computing time of each kernel in block LU-decomposition v1 on platform #4. ... 61
Figure 12- Performance of block LU-decomposition v2 on different platforms. ................. 63
Figure 13 - Computing time of each kernel in block LU-decomposition v2 on platform #4. ... 64
Figure 14 - Performance of block LU-decomposition v3 on different platforms. ................. 65
Figure 15 – Showing the sub-matrix part of the triangular solve method. ......................... 66
Figure 16 – A 10.000 x 10.000 matrix LU-decomposed on platform #3 and #4. ................... 67
Figure 17 - Peak performance of LU-decomposition v3 on platform #3 ............................ 68
Figure 18 - Storage strategy for the compressed Householder QR-factorisation ................. 72
Figure 19 - Matrix A being decompose by block QR-decomposition in steps. ..................... 78
Figure 24 - Cuda thread organisation [4] ................................................................ 90
Figure 20 – Block diagram of a chipset. Source: Intel ................................................. 98
Figure 21 - CPU and bus details of platform #1 ........................................................ 99
Figure 22 - System memory details of platform #1 ................................................... 100
Figure 23 - GPU details of platform #1 .................................................................. 101
vi
List of tables
Table 1 - Hardware specifications for the four platforms ........................................... 21
Table 2 - Measured bandwidth of Cuda memory transfer operations .............................. 23
Table 3 - Measured gigaflops performance of GPU .................................................... 25
Table 4 - Test result of outer loops matrix-multiplication on platform #1 ........................ 34
Table 5 - Test result of outer loops matrix-multiplication no structure on platform #1 ........ 36
Table 6 - Test result of matrix-multiplication for resulting matrix on platform #1 .............. 37
Table 7 - Test result of matrix-multiplication for tiling strategy on platform #1 ................ 40
Table 8 - Tiling with 2 and 4 outputs per thread comparison for different platforms .......... 44
Table 9 - Kernel invocation overhead ratio of total running time .................................. 54
Table 13 - Cuda built-in variables ........................................................................ 91
Table 10 - GPU specifications for Nvidia GeForce 9400m, platform #1 ............................ 96
Table 11 - GPU specifications for Nvidia GeForce 8800 GS, platform #2 .......................... 96
Table 12 - GPU specification for Nvidia Tesla C1060, platform #3.................................. 97
Table 14 - Selected profile counter from Compute Visual Profiler User Guide .................. 105
1
Summary
The purpose of this project was to uncover characteristics, features and limitations of the
Cuda architecture. An optimisation strategy was formed, containing methods and techniques
that supposedly enabled increased performance.
Three frequent used linear algebra algorithms for matrix-multiplication, and LU- and QR-
decomposition was chosen. These algorithms were then implemented as a simple version, and
performance and correctness test were performed. GPU.NET was used as a frame of
reference where applicable.
The optimisation strategy was used to improve the performance of the implemented
algorithms. It was found that a linear block algorithm could achieve better performance, than
a regular algorithm.
The main output from this project was a list of recommendations and experiences from the
tests performed on the linear algebra algorithms. The findings from this project suggested
that tiling was the best strategy, followed by latency hiding and coalescing memory access,
when optimisation was the goal.
In addition to the points above, this list describes recommendations based on the testing:
• Avoid using structures as parameters in the kernel definitions, use instead simple
types or pointers thereof.
• Target the highest possible Compute Capability level. Among other things, the
precision of instructions are better and the result will be more accurate.
• Unroll loops, by making the threads fine-grained. Generation and thread scheduling
are cheap.
• Thread block size should be a multiple of the warp size (Currently 32).
• Be aware of the overhead for invoking a kernel.
• Note that default instructions deviate from IEEE 754, use specific IEEE 754 functions
for increased precision, but at the cost of speed.
Besides the list and suggestions above, there were also methods with doubtful results:
• The Volkov suggestion yielded performance gains on some systems, but lower on
others. Can be useful for low occupancy kernels, but should be tested and evaluated.
• Data prefetching can both increase and lower performance.
2
It was pointed out that the underlying hardware and its capabilities played an important role
to whether an optimisation technique affected performance. Some methods had positive
effect on some GPUs, and a negative on others. Analysing and testing should therefore always
be performed.
The purpose described in the problem definition was achieved, and the learning goals were
reached with satisfaction.
3
1 Introduction
This 30 ECTS thesis project has been produced by Mikkel Bundgaard-Ovesen from the 1st
February 2011 to 1st August 2011, on the ITU Copenhagen. The project builds on the results
from the report ”Documentation of the GPUs usability in advanced parallel calculations” [1],
and has been supervised by Peter Sestoft.
1.1 Motivation
The speed of computers has increased over the years as a result of increased demand for
processing power. The CPU has from the beginning, been the preferred architecture for
computing. But during the last decade, an additional computing architecture has evolved,
namely the graphic processing unit (GPU). A GPU, also called a massively parallel processor,
offer tremendous performance in gigaflops, at a relatively low cost.
Different parallel computing architectures, such as Nvidia’s Compute Unified Device
Architecture (Cuda), Open Computing Language (OpenCL) and Microsoft’s DirectCompute have
been developed to serve as a platform for general purpose programming on the GPU (GPGPU),
to enable massively parallel programs.
Utilising the immense GPU power is not a trivial task. The execution model of the GPU is
becoming more and more flexible, but being a SIMD model puts restrictions on utilisation.
Data-parallel algorithms that have a simple execution path, and high arithmetic intensity are
usually well suited for processing by the GPU architecture. But, there are indications that
other algorithms, which do not share these characteristics, in fact can be optimised in a way,
such that they are accelerated by the GPU.
The huge performance offered at a relatively low cost, makes it interesting to find out how
this power can be harnessed. In this project, I will look into how the linear algebra operation
matrix-multiplication and the decompositions LU and QR can be implemented and optimised,
on the Nvidia Cuda architecture.
1.2 Reading guide
This report is addressed to persons with interest for GPGPU. The report assumes that the
reader has knowledge of C and development experienced. Knowledge of linear algebra and
the algorithms would be beneficial.
References in the report are shown in the text as [number], and the reference can be found
in the bibliography and references list.
4
The report is divided into 11 chapters.
The first chapter describes the purpose and the goals of the project.
The second chapter gives a short introduction to linear algebra and GPU computing history,
readers can skip this chapter.
The third chapter describes the parallel platforms Cuda and GPU.NET. The paragraphs 3.1.4
and 3.1.5 are the most important.
Chapter 4 focuses on the hardware platform and its influence on performance. The different
development and test systems are described, and the importance of the chipset’s North
Bridge is described. The paragraph 4.2.2 holds a description of the CGMA term, and an
analysis is performed.
Chapter 5 describes the development environment together with some design decisions. The
most important section is 5.3 that holds the optimisation strategy used later.
Chapter 6 analyses the matrix-multiplication algorithms, and describes its implementation
and optimisations. The results of the improvements are found throughout the chapter, but an
evaluation can be found in paragraph 6.4.
Chapter 7 deals with LU-decomposition. The chapter describes the algorithm, its
implemented and optimisation, along with the test results. An evaluation can be found in
paragraph 7.4.
Chapter 8 looks into QR-decomposition. The algorithm is described and analyses, after which
it is implemented, improved and tested. Results are found throughout the chapter, but an
evaluation can be found in paragraph 8.5.
Chapter 9 tries to summarise results from all three algorithms, and compare them with the
initial optimisation strategy. This chapter is important and serves as an evaluation and
conclusion.
Chapter 10 looks at the work done so far, and discusses possible extensions to the project. A
broader perspective is also discussed, looking into Cuda, hardware and GPGPU in general.
Chapter 11 is solely the conclusion to the problem definition, for an evaluation on the
projects results, please refer to chapter 9.
5
1.3 Problem definition
The purpose of this project has changed during the project period. Initially the focus was
firstly to identify linear algebra algorithms suitable for implementation on the parallel Cuda
architecture, secondly in the process of implementing these algorithms, trying to understand
the Cuda architecture. In reality, my supervisor Peter Sestoft and I agreed to focus on three
frequent used linear algebra algorithms for matrix-multiplication, and LU- and QR-
decomposition. By having these algorithms selected in advance, this project can focus on the
core objective, to uncover characteristics and features of Cuda.
The following statement and the elaborating points reflect this clarification:
Firstly, implement linear algebra algorithms for matrix-multiplication, LU-
decomposition and QR-decomposition and evaluate their performance on the
parallel GPU Cuda architecture. Secondly, to analyse, test and describe
different optimisation techniques relevant to the Cuda implementations, and
furthermore describe how they may be used in general.
• Describe the linear algebra algorithms and their characteristics.
• Describe the Cuda architecture and development platform, uncovering
characteristics, features, problems and a future outlook.
• Analyze and implement the linear algebra algorithms on the Cuda platform. Describe
how an implementation can be performed including any benefits as well as
limitations.
• Analyze and document optimisation techniques for the algorithm implementations on
Cuda.
• Perform correctness and performance test.
Learning goals
• Knowledge of linear algebra and linear algebra algorithms
• Understanding of the Cuda architecture and platform
• Obtaining skills in C/C++ and Cuda C
• Ability to implement linear algebra algorithms using C/C++ and C for Cuda
6
1.4 Method
1. Study literature on linear algebra, C and C++ and Cuda architecture development.
2. Implement basic versions of linear algebra algorithms in C/C++ and C for Cuda using
Visual Studio and Nvidia Nsight. Develop tests and benchmarks and compare results
with comparable CPU implementations.
3. Implement optimisations for the algebra algorithms and compare results with CPU
implementations.
As mentioned before, this thesis builds on the experiences and results of the project
”Documentation of the GPUs usability in advanced parallel calculations” [1]. One of the goals
of that project was to uncover how the GPU could be utilised from .NET. This is not a specific
goal for this thesis; however, I regard it an important perspective.
During the thesis research period, I discovered GPU.NET by TidePowerd, a framework and tool
whose main feature is to bridge Cuda and .NET. In this project, GPU.NET will be used, where
it makes sense, to compare algorithm implementations and their performance with the pure
Cuda implementations.
It will be interesting to see how GPU.NET performs compared to pure Cuda C/C++, and
furthermore, whether GPU.NET is easier to.
Testing the correctness of algorithms in both GPU.NET and Cuda will be compared to results
computed by the CPU.
1.5 Scope
All areas of this project cannot be analysed and documented, prioritising is important so the
parts that are processed is done with adequate depth.
1.5.1 Algorithms
This project is an empirical study that should document implementation, optimisation and
performance of existing linear algebra algorithms on the Cuda platform. It is not part of this
project to develop new algorithms, but merely to base the testing on existing. The algorithms
selected are designed for dense linear algebra, and are well known and well documented.
1.5.2 Numerical stability
A numerical stability analysis of the different algorithms is outside the scope of this project.
The algorithms are well known, and are well documented in terms of numerical stability.
That said, the applicability of an algorithm implementation obviously depends on it delivering
7
a correct result. All algorithms are implemented for both the GPU and CPU, and tests are
performed on both platforms to compare the results. The maximum difference in the result
indicates how well and precise the GPU implementation performs compared to the CPU
implementation.
1.5.3 IEEE 754 and double-precision
Devises with Cuda, supports double-precision floating point operations from Compute
Capability (CC) version 1.3 and higher. Support for double-precision operations in the
development and test computers is not a common denominator, furthermore double-precision
operations impact performance significantly, why I choose to implement algorithms using
single floating point operations.
The IEEE 754 standard for floating point arithmetic, is supported and followed by Cuda, but
there are documented deviations from the standard [2]. For instance the FMAD (multiply-add)
instruction looses precision by combining two operations into one instruction, and there are
other deviations. These deviations will influence the correctness test, but because
performance is prioritised higher than exact precision, I will not take any specific
precautions, as it would impact performance. The maximum differences from the correctness
test will give an indication as to how these deviations from the IEEE 754 standard, may affect
precision.
1.5.4 BLAS
Basic Linear Algebra Subprograms (BLAS) is an interface for linear algebra operations, and it
offers optimised operations for vectors and matrices. Many linear algebra algorithms are
designed on the basis of these operations, but the implementations in this project will not use
any BLAS API, even though Cublas would be obvious.
To really uncover the architectures capabilities it is necessary to experiment with it directly.
For that reason I implement all algorithms without the use of such math libraries. This will
mean that the full performance potential of the algorithms will not be achieved, but it will
give better insight.
8
2 Background
This chapter will work as an introduction to the ideas that will be used throughout the report.
Firstly linear algebra will shortly be described, after which the concept of parallel computing
in relation to the GPU.
2.1 Linear algebra
Hermann Grassmann is known as the inventor of linear algebra [3]. He did not invent and
describe the entire field, but recognized linear algebra as a formal theory. In his two releases
of “Ausdehnungslehre”, he describes some important ideas that helped define the basis of
linear algebra as it is known today.
Linear algebra is a term that covers several different topics and binds them together. Some of
these topics are system of linear equations, linear transformation, vectors, matrices,
determinants and vector spaces.
Frequently used in linear algebra are matrix-multiplication, LU- and QR-decomposition, which
each serves a purpose in either solving a system of linear equations, or a linear least squares
problem. These problems are then again often encountered in the fields of research,
engineering, physics, economics and statistics.
2.2 GPU computing
The performance and capabilities of graphic processors has gone through an incredible
development from the beginning, and up till today. From the command line based user
interfaces in the 1980s, to the more graphical driven interfaces from the 1990s and all the
way till today, graphic processing power has increased and has evolved to support 2D, 3D,
OpenGL, DirectX and more [4]. The release and popularity of 3D computer games furthermore
accelerated the demand and development of more and more powerful graphic processors.
Nvidia released on 31st August 1999 the GeForce 256, the release of the world’s first GPU [5],
a GPU that could perform graphical computations directly on the graphics processor. ATI, the
main competitor of Nvidia, soon followed, by releasing their Radeon R100 chips with the same
capabilities.
But it was not until 2001 that the major breakthrough in relation to GPU computing came.
Nvidia released the GeForce 3, the first chip to support Microsoft’s DirectX 8 standard, which
required the chip to support programmable vertex- and pixel shaders. ATI followed with the
release of their Radeon R300 chip in 2002.
9
Programmable vertex- and pixel shaders were intended solely for graphics rendering, however
they were actual small programs that performed a programmed computation on some input,
and then returned the output. The computational power of the GPU combined with the
programmable vertex- and pixel shaders feature made developers look into how the GPU,
could solve other problems than just graphics rendering.
10
3 Parallel platforms
The following chapter will take a deeper look at Cuda and GPU.NET; describe usage, features
and performance limiting factors. GPU.NET uses Cuda, so most energy will be on describing
and analysing Cuda.
3.1 Cuda
Cuda stands for Compute Unified Device Architecture and is a generic term covering the GPU
architecture of Nvidia’s graphic cards, development platform and tools. It can be described as
a parallel computing architecture and development platform that enables the GPU to solve
general purpose computational problems [4].
3.1.1 History
The first GPU was released in 2001, and in the early stages the only way to access the GPU
was through a graphics API, such as OpenGL or DirectX. This meant that general use of the
GPU was difficult. Nvidia saw the potential of the GPU as another computing platform, and
they initiated the development of a completely new architecture. This architecture was to
overcome the limitations of earlier GPU’s, by allowing General-Purpose computation on
Graphics Processing Units (GPGPU), without the need to use a graphics API.
Nvidia released in 2006 GeForce 8800 GTX, the first GPU to support the Cuda architecture.
Later, in June 2007 the first version of the Cuda development toolkit was released. Over the
years the architecture and toolkit have undergone development and improvements, with the
latest toolkit released May 2011.
3.1.2 Version
The latest version of the Cuda when this project initiated, was version 3.2, released
November 2010. Many things have happened since, and the current toolkit version, as of May
2011, was version 4.0.
Some of the new features include “Share GPUs across multiple threads”, “Use all GPUs in the
system concurrently from a single host thread” and “No-copy pinning of system memory, a
faster alternative to cudaMallocHost()”. Even though the last feature is interesting, none of
these newly added features bring any major benefit to this project, so any upgrade during the
project phase was deemed unnecessary. Hence, version 3.2 is used throughout this project
and report.
11
3.1.3 Cuda program
A Cuda program is a hybrid between code processed by the CPU and code processed in
parallel by the GPU. The CPU is referred to as the host, and the CPU code is called host code.
The GPU is referred to as device, and the code is surprisingly called device code.
A typical example of a Cuda program is shown in the following. The host code is written in
standard C or C++ as shown here:
1. // Declare device pointers 2. int *d_base, *d_n, *d_out;
3. int blocks = (N+THREADS_PER_BLOCK-1)/THREADS_PER_BLOCK; 4.
5. // Allocate memory on device
6. cudaMalloc( (void**)&dev_base, N * sizeof(int) );
7. cudaMalloc( (void**)&dev_n, N * sizeof(int) ); 8. cudaMalloc( (void**)&dev_out, N * sizeof(int) );
9.
10. // Copy date from host -> device 11. cudaMemcpy( d_base, base, N * sizeof(int), cudaMemcpyHostToDevice); 12. cudaMemcpy( d_n, n, N * sizeof(int), cudaMemcpyHostToDevice); 13. 14. // Execute kernel 15. power<<<blocks,THREADS_PER_BLOCK>>>(d_base, d_n, d_output, N); 16. 17. // Copy data from device -> host 18. cudaMemcpy( out, d_out, N * sizeof(int), cudaMemcpyDeviceToHost); 19.
20. // Free memory on device 21. cudaFree( d_base ); 22. cudaFree( d_n ); 23. cudaFree( d_output ); 24.
25. // Let the Cuda runtime now that we are finished 26. cudaThreadExit();
12
The device code is structured in a function called a kernel, as shown here:
1. // Device kernel called power 2. __global__ void power( int *base, int *num, int *output, int N ) {
3. 4. // The unique thread id
5. int tid = threadIdx.x + blockIdx.x * blockDim.x;
6.
7. // Guard, only if thread has actual data to process 8. if (tid < N) {
9.
10. // Initialise register with values from input array 11. int p = 1, base = base[tid], num = n[tid]; 12. 13. // Compute p 14. for (int i = 1; i <= num; ++i) 15. p = p * base; 16. 17. // Write result to output array 18. output[tid] = p; 19. } 20. }
The host sends commands and messages to the device by invoking functions. A sequence
diagram, based on the program illustrated above, is shown in Figure 1.
Figure 1 - Cuda program sequence diagram
Line 6 to 8 in the host code allocates memory on the device, line 11 and 12 copies data to the
device and the kernel is invoked in line 15. When the kernel is done processing, the host
13
copies the data back from the device memory in line 18, where after the memory is release
again in line 21-23.
3.1.4 Architecture
Cuda is an architecture consisting of both the physical layout of the GPU and the logical
structure of threads in the Cuda runtime. The exact physical layout and specifications differs
for different chip versions1, and the capabilities of these chips are defined by the Compute
Capability version (CC). The first chips were released with CC 1.0, and the latest version is
2.1.
Physical layout
The GPU shown in Figure 2 is a simplified G86 GPU with CC version 1.1. It consists of four
streaming multiprocessors (SM) and each SM contains 8 streaming processors (SP) and two
special function units (SFU). In a SM, the SP processes normal instructions like add and
multiply, and in this case 8 SP’s are able to process 8 normal instructions per clock cycle. The
SFU processes instructions related to square root, sine and cosine, logarithmic and
exponential, so a kernel with heavy usage of these instructions will be limited to only two
instructions per clock cycle.
A SM has, beside the SP and SFU, also access to different memory types. The register and
shared memory are limited in size, but very fast. They are in addition to this local to each SM.
The register is the fastest memory type, and local to a thread, and the number of 32-bit
registers of a SM with CC 1.1 is 8192, or 8K. The shared memory is a bit slower, but there is
more of it and it is shared between the threads in a block. The shared memory size of a SM
with CC 1.1 is 16KB.
All of SM’s of a GPU has read access to the constant memory, which for all current CC
versions is 64KB. Access to the constant memory is cached and generally faster than the
global memory, which is the device’s main memory. All SM’s have shared access to it, and the
exact size and speed is device dependent.
1 For instance G80 allows 768 threads per multiprocessor, GT200 allows 1536.
14
Figure 2 - Cuda architecture with four multiprocessors
Memory
As described above, there are different memory types, but the common denominator is that
they are all typically based on dynamic random access memory (DRAM). Accessing a single bit
in a DRAM cell is a slow process, and to improve performance DRAM controllers read several
consecutive bits in parallel [4]. This means that actual random access to DRAM memory will
yield a low performance. So to achieve the highest memory performance possible, the kernel
should access consecutive memory locations, as much as possible. This is also called
coalesced memory access.
Accessing memory in a coalesced manor is important for all memory types. This also holds for
shared memory even though it is on-chip and fast, and in addition to this, access to shared
memory should also minimise bank conflicts. Shared memory on CC version 1.1 has 16 banks,
and the bandwidth of each bank is 32 bits per two clock cycles. If two or more threads access
the same bank, the access will be handled sequentially and hence impact performance. From
CC version 2.0 and higher, simultaneous access to the same bank has been optimised. Multiple
15
reads from a single bank only results in a single read instruction being performed, after which
the values is broadcasted to all threads.
Cuda threads
Cuda threads are very different from CPU threads; the only similarity is the fact that they
process data in parallel. The GPU can be classified as SIMD, which makes its applicability
differ from that of a CPU. A task, or a set of instructions, can be performed on different data
in parallel, and the SIMD means that two independent tasks cannot be performed in parallel
by the GPU. Furthermore, threads in Cuda are very lightweight compared to threads on the
CPU. A typical Cuda program uses several thousand Cuda threads, and thread generation and
scheduling should therefore not be considered a limitation.
Cuda threads are organised into a block, and the threads in a block can share memory and be
synchronised. The blocks are then again organised into a grid. This logical thread structure
allows threads to be organised in several dimensions, which makes structuring of threads
directly correlate to specific data structures, a matrix for instance is defined in two
dimensions.
More details about how threads are organised and how this affects usage in a kernel please
refer to appendix B.
Thread scheduling
The presumption is that the threads of a block are grouped, and processed in parallel. This is
conceptually true, but in reality is not actually happening. The current implementation of
thread scheduling in e.g. G80 and GT200 chips schedules threads using a term called a warp.
A warp is a bundle of 32 threads being executed in parallel, and a block with for instance 128
threads, are partitioned into 4 warps.
These threads share a single instruction set, hence Cuda is SIMD architecture. This is a design
decision to reduce hardware cost and to enable optimisations techniques, and it is not
without relevance to the developer.
The size of a warp has direct impact on the recommended size of blocks. Consider the
example where a problem is organised into 20 blocks each with 10 threads, giving a total of
20 x 10 = 200 threads. Cuda executes 32 threads in a warp in parallel. In the example above,
only 10 threads are available per block. Cuda will in this case fill up the warp with 22 empty
threads, resulting in 20 x 22 = 440 empty threads being created. It is advisable to set the
block size to a multiple of the warp size, currently 32 [4].
16
Occupancy and latency
A SM with a CC of version 1.1 is able to handle 768 residing threads, and as the current warp
size is 32, the maximum number of residing warps per SM is 24. The actual number is
dependent on the kernels consumption of registers and shared memory. The Cuda occupancy
is, for a given kernel, the ratio of active warps to the maximum number of warps supported
by the SM. In other words, the occupancy indicates how many active warps and threads a SM
can hold.
The number of clock cycles it takes for a warp to be ready to execute its next instruction is
called latency. There are instructions that incur latency, for instance global memory access,
which incurs high latency before the data is supplied. The execution of a warp does not halt
due to a memory access; execution is continued until the data is actually needed, if the data
is still unavailable the scheduler switches to another warp. Whenever a warp incurs latency,
the SM should switch to another warp and start processing to achieve full utilisation of the
SM. So the Cuda occupancy ratio can indicate if the performance of a kernel suffers from high
latency. Vasily Volkov [6][7] have however shown that high performance is not necessarily
equal to a high occupancy, so improvements based solely on the occupancy ratio should be
carefully evaluated.
Optimisation that aims at full utilisation of the SM is called latency hiding.
3.1.5 Limitations
Using Cuda can be advantages, but there also limitations that are influence the
implementation and optimisation of algorithms. In the following I describe a couple that are
relevant to this project.
One should be aware that the Cuda architecture was developed for speed at the expense of
precision. There can, for that reason, be a higher numerical instability of an algorithm
implemented on the GPU, when compared to the same algorithm implemented on the CPU.
For example, the operations multiply and add, can be contracted to a single FMAD (multiply-
add) instruction, which specifically deviates from the IEEE 754 standard. FMAD instructions
are for instance often used in linear algebra algorithms to calculate dot-products, vector
norms and more. Nvidia has been focusing on this, and latter CC versions should comply
better with the IEEE 754 standard.
Latency, warps and memory access described in the architecture chapter 20, are all factors
influencing the computational performance. One should therefore not expect to reach close
to the theoretical performance of a device, as these factors will limit the performance of an
algorithm. The theoretical properties can, on the other hand, be of assistance in the analysis
and optimisation of a kernel.
17
Cuda C is an extended version of ANSI C, and is the language in which the device code or
kernels are written. A kernel function is able to call other device functions, but recursion is
currently not supported.
Cuda is developed by Nvidia, and can only be used in Nvidia GPUs.
3.2 GPU.NET
The framework GPU.NET consists of a runtime and a compiler, which are integrated with
Visual Studio. This framework makes it possible to develop host and kernel code directly in
.NET with all the benefits that .NET and Visual Studio provides.
The version being used for this project is GPU.NET v1.0.3.5.
3.2.1 Overview
GPU.NET currently only supports Cuda, but expect to support other parallel architectures in
the future. GPU.NET allows a developer to write host and device code directly in .NET using
the API from the provided assembly, and thereby making computations on hardware
accelerated architectures.
Accelerating .NET code is achieved in two steps; first the .NET code is written, decorated and
compiled, then the GPU.NET runtime accelerates the program during execution, as shown in
Figure 3.
Figure 3 - How GPU.NET works as describe on TidePowerd.com
18
3.2.2 Development
Visual Studio 2010 is supported for development as well as .NET 4. The kernel is annotated
with a KernelAttribute that also holds the name of the CPU method to be used if no
acceleration hardware is present. ThreadIndex and BlockIndex hold the same values as when
used in Cuda directly.
1. [Kernel(CustomFallbackMethod = "MatrixMultiplication_CPU")]
2. private static void MatrixMultiplicationSimpleNS_GPU(float[] a, float[] b, float[] c, int aheight, int awidth, int bwidth)
3. {
4. // Thread ID 5. int tid = ThreadIndex.X + BlockIndex.X * BlockDimension.X;
6.
7. if (tid < aheight)
8. { 9. for (int j = 0; j < bwidth; ++j)
10. { 11. float sum = 0; 12. 13. for (int k = 0; k < awidth; ++k) 14. { 15. float av = a[tid * awidth + k]; 16. float bv = b[k * bwidth + j]; 17. sum += av * bv; 18. } 19. c[tid * bwidth + j] = sum; 20. } 21. } 22. }
The kernel is returns void, and is private, and can therefore not be called directly. Another
public and static method is created, which calls the kernel shown in line 5.
1. public static float[] MatrixMultiplicationSimpleNS(float[] a, float[]
b, int aheight, int awidth, int bwidth) 2. {
3. var c = new float[aheight * bwidth];
4. 5. MatrixMultiplicationSimpleNS_GPU(a, b, c, aheight, awidth, bwidth);
6.
7. return c;
8. }
The .NET code is compiled to a normal assembly, in which the GPU.NET compiler then injects
calls to the GPU.NET runtime. The result is a modified .NET assembly where calls to any
kernel method are being redirected to the GPU.NET runtime, and hence the GPU.
19
3.2.3 Execution
When the program is being executed, the GPU.NET runtime detects the availability of any
supported hardware. When a call to a kernel is detected, the kernel code is then passed to
the correct vendor plug-in, which in turn JIT compiles the code to the hardware vendor’s
instruction set architecture. Lastly the runtime executes the compiled device code and
transfers any data back to the .NET runtime. If no hardware acceleration is present, then the
CPU version of the kernel is called.
3.2.4 Limitations and bugs
GPU.NET is a relatively young framework that contains obvious bugs and limitations. The
v1.0.3.5 contains the following bugs and limitations:
• There is currently only support for Nvidia Cuda v3.0 and newer, but support for AMD
devices are under development.
• Local variables and parameters can only be of primitive types. Parameters can in
addition to that, also be an array of a primitive type.
• Shared memory can only hold fields which are primitive types or a single dimensional
array of primitive types.
• Kernels must be static and return void and cannot be recursive or call any other
methods.
I have experienced problems with casting variables in the kernel. Casting variables from
double to float, or even float to float resulted in compile errors. My conclusion is that casting
is not supported and should be avoided. So it is necessary to design an algorithm exclusively
for either single- or double precision.
The shared memory size of a kernel must be specified on compile-time; this can in Cuda C be
specified dynamically on runtime. Implementing kernels optimised for different data sizes is
therefore difficult, which has led me to set the size of the allocated shared memory high.
This makes sure that kernels will run with different data sizes, but this is not optimal. Shared
memory is a scarce resource, and will lower occupancy.
Occasionally when a CPU.NET application was executed for the first time, an exception was
thrown. Subsequent executions were processed with no problems. Furthermore, due to a
thread bug a GPU.NET program does not exit by itself. I have solved this by terminating the
process by calling:
1. System.Environment.Exit(0)
20
The last two bugs are expected to be fixed in newer releases.
3.2.5 Evaluation
GPU.NET can definitely be used for testing and playing with GPU acceleration of programs.
But one should, with the current version, expect bugs and minor problems; the framework is
far from mature at this point.
Furthermore, by using JIT compilation the GPU.NET will incur a performance hit compared to
Cuda, as Cuda kernels are already compiled at runtime. This is a design decision made by
TidePowerd, a decision that makes the framework flexible at the expense of performance.
GPU.NET does however cache the JIT compiled kernel in-memory, so subsequent calls to the
same kernel will not incur the same performance hit.
21
4 Hardware platform
The parallel architecture software has been described above, but there are also hardware
specifications that are important to the performance of Cuda. The following chapter will
analyse hardware specifications and perform two simple tests.
Cuda requires Nvidia GPUs, so all development and test computers must be equipped with an
Nvida GPU. I used two development and two test computers for this project. All machines are
running Windows 7, and have Cuda v3.2 installed, including the matching drivers. The
following table shows selected hardware specifications for the platforms.
System #1 #2 #3 #4
Type Development Development Testing Testing
Graphics bus 8.00 GB/s
(PCI-E v2.0)
4.00 GB/s
(PCI-E v1.1)
8.00 GB/s
(PCI-E v2.0)
4.00 GB/s
(PCI-E v1.1)
Host memory 16.60 GB/s
(DDR3)
6.23 GB/s
(DDR2)
16.60 GB/s
(DDR3)
6.25 GB/s
(DDR2)
GPU GeForce 9400m
(G86)
GeForce 8800 GS
(G92)
Tesla C1060
(GT200)
GeForce GT440
(GF108)
Cores 16 96 240 96
Shader clock 1100 MHz 1250 MHz 1300 MHz 1645 MHz
Device memory 8.00 GB/s2
(DDR3)
37.45 GB/s
(GDDR3)
102.40 GB/s
(GDDR3)
51.20 GB/s
(GDDR5)
Processing power in
gigaflops (MUL+ADD+SF) 34.38 351.56 914.06 462.66
Processing power in
gigaflops (MUL+ADD) 51.56 234.38 609.38 308.44
Compute Capability 1.1 1.1 1.3 2.1
Table 1 - Hardware specifications for the four platforms
2 The actual memory speed is 16.60 GB/s, but system #1 has no dedicated device memory and uses host memory. So the device is limited by the speed of the graphics bus.
22
The reason why the GPU of each system has two processing powers stated, is based on the
theoretical peak performance in gigaflops Cuda architecture design. The first is based on the
GPU architecture design, which says that a GPU is capable of performing a Multiply-Add
instruction dual-issued with a special function instruction per operation cycle. The second is
based on a more realistic estimation, in which a operation cycle can perform a Multiply-Add
instruction dual-issued.
For a detailed description of the different platforms, please refer to appendix C.
4.1 Analysis
This paragraph will dig a little deeper into the important hardware specifications, and
describe the theoretical performance limits. When dealing with GPUs the important factors
are memory transfer rates and GPU processing power. Figure 4 shows a simplified chipset
diagram, which highlight the important elements, namely the processor, DDR ram, GPU and
the chipset’s north bridge.
Figure 4 – Simplified diagram of a chipset
23
In Table 1, the graphics bus indicates the maximum transfer rate between the north bridge
and the GPU device. Host memory is the peak bandwidth between the DDR ram and the north
bridge. A chain is not stronger than its weakest link, and the same holds for data transfer
between the host and device, and vice versa. Consider platform #1, the bandwidth of the host
memory is 16.60 GB/s, but the graphics bus is limited to 8.00 GB/s, which then is the
theoretical peak transfer rate between the GPU device and the host system.
4.2 Benchmarking
The specifications are theoretical, and to give a more realistic performance target I have
tested actual data transfer and processing power.
4.2.1 Memory performance
All systems have been tested with three types of memory transfers; from host to device,
device to host and device to device. To improve performance of memory transfer, host
memory can be defined as page-locked (pinned) and write-combined. Pinned and write-
combined memory is a scarce resource of the operating system. A Cuda program can
therefore not without caution, consume this operating system memory resource, as it could
impact overall system performance. Bandwidth measurements, shown in the following table,
are performed with both regularly paged memory, and with pinned and write-combined
memory (P+WC). For more details please refer to appendix B.
System #1 #2 #3 #4
Host -> Device 1,584.5 MB/s 1,434.7 MB/s 4,233.7 MB/s 1,578.6 MB/s
Host -> Device (P+WC) 5,224.9 MB/s 2,513.1 MB/s 5,761.9 MB/s 2,509.8 MB/s
Device -> Host 1,365.9 MB/s 1,178.0 MB/s 3,864.3 MB/s 1,235.1 MB/s
Device -> Host (P+WC) 5,096.2 MB/s 1,687.9 MB/s 5,297.6 MB/s 1,857.9 MB/s
Device -> Device 6,935.4 MB/s3 28,525.7 MB/s 73,463.8 MB/s 21,338.1 MB/s
Device -> Device (P+WC) 6,951.3 MB/s3 28,529.0 MB/s 73,527.3 MB/s 21,339.2 MB/s
Table 2 - Measured bandwidth of Cuda memory transfer operations
3 System #1 does not have any dedicated device memory, so actually the rates from host to device are relevant when a kernel needs to access ”device” memory.
24
The measured transfer rates between host and device are, as expected, faster when using
pinned and write-combined memory. The ratio between measured bandwidth for paged
memory transfers are between 17% and 40% of the theoretical bandwidth, the ratio span
increases to between 47% and 72% when pinned and write-combined memory is used. The
result also shows that in general, device to host transfers are slower than its counterpart host
to device.
The memory speed for copying data from host to device and vice versa is mostly important for
hybrid algorithms, meaning an algorithm that solves a problem by using both the CPU and
GPU. An implementation that requires transfer between the host and the device should be
designed with caution, as this would put a restriction on performance. Based on this, the
strategy for the algorithm implementations of this project is to keep the data processing
solely on the GPU, and limit the number of data transfers between host and device. Consider
the worst case memory transfer scenario. System #3 has 4GB of device memory, and if this
GPU was installed in the slowest system, it would be possible to copy all 4GB in about 3.57
seconds.
Global memory access is limited by device memory bandwidth, so the device to device
memory transfer rate is interesting and relevant to the performance of a kernel. The results
are between 41% and 75% of the theoretical limit. Paged or pinned/write-combined memory
does not have any impact as this is an operating system resource, and hence only relevant for
the host memory.
The result of the device to device memory transfer for system #1 is a bit misleading. System
#1, does not have any dedicated device memory, and a device to device transfer rate then
only indicates the peak performance of the DDR3 ram on the host system. Before the data
could actually be processed by the GPU, it would have to pass the north bridge and graphics
bus, which is the same as the host to device memory transfers.
4.2.2 Arithmetic performance
To get a realistic performance target in gigaflops for the GPU, I have created two kernels
each consisting of three operations MUL+ADD+SF. The first kernel is normal in the sense it
reads data to process from global memory, by doing so the kernel is limited by global memory
performance. But as a normal kernel would make computations based on data from global
memory, the peak performance of this kernel can be considered a normal case scenario and
help define performance expectations.
The second kernel does not access global memory, but consist of just the three MUL+ADD+SF
operations. The peak performance in gigaflops of this kernel should be closer to the
theoretical maximum arithmetic performance.
25
System #1 #2 #3 #4
MUL+ADD+SF + global read (gigaflops) 2.19 9.70 27.87 8.18
MUL+ADD+SF (gigaflops) 15.41 60.09 62.29 29.08
Table 3 - Measured gigaflops performance of GPU
In reality, these kernels do not say anything about the maximum expected performance. An
implementation of an algorithm can be optimised in several ways, and so could these kernels.
However, they do suggest that global memory access is indeed a limiting factor. This factor is
referred to as Compute to Global Memory Access (CGMA).
The first kernel has a memory load from input and a write to output. The number of
operations are three (multiply, add and power). Based on these numbers, the calculated
CGMA is 1.5. Consider system #3, the device memory peak performance is 73,463.8 MB/s. The
kernel uses single floating values that are 4 bytes. So the system is able to transfer about
18,365.95 mega single float values. The CGMA is 1.5; hence the peak performance of this
kernel is about 27 gigaflops, which the result from Table 3 also shows. The memory transfer
rates, together with an estimated CGMA, can be important tool when analysing a kernel for
optimisation.
With these results in mind, how are they compared to the performance of a CPU? Consider
that a Pentium 4 3.06 GHz CPU computes a single-precision float values dot-product with
between 1.8 gigaflops (single thread) and 3.08 gigaflops (multiple threads) [8]. In that light,
even though the results from Table 3 are far lower than the theoretical processing power, it
is evident that even un-optimised kernels could have a similar peak performance or even
higher, when compared to the CPU.
26
5 Implementation
In the following chapter I describe the development environment and some design decisions,
but more importantly, I form an optimisation strategy used for the algorithms.
5.1 Development environment
The computers used in this project are based on Windows 7, and the Cuda toolkit version 3.2
was the latest release when this project was initiated. Cuda v3.2 natively supports Visual
Studio 2008 (VS2008). It is possible to enable development in Visual Studio 2010, but has
proven difficult to setup.
Making VS2008 ready for Cuda development is not a trivial task. The compilation process
includes two compilers, the Nvidia Cuda Compiler (NVCC) and Microsoft’s Visual C++ compiler
(VCC). To configure VS2008 properly and making the NVCC, VCC and linker play together
remained a challenge. Please refer to appendix D for a detailed description of the problems
involved, and for a development model solution.
5.2 Design decisions
The general rule is, the parts of the algorithm that exhibit little or no data parallelism should
be processed by the host, the parts that exhibit rich amount of data parallelism should be
processed by the device. Sometimes it is beneficial to process code on the device that cannot
exploit the parallelised architecture. The decisive factors in these situations are the size of
the data, and the time needed to transfer data between the host and the device. The
strategy I will follow is to limit transfer of data between host and device to a minimum [9],
by reducing these transfers to an initial and a final one, like this:
1. Copy data to device
2. Process data on device
3. Copy data to host
This means that these data transfers are not part of the actual algorithm, and when
measuring peak processing power in gigaflops, I only measure the core algorithm execution
time. Meaning, the initial configuration and data transfer, combined with the releasing of
resources and retrieval of the output, is not being measured. By exclusively measuring the
core algorithm, it is possible to directly compare the peak processing power of the GPU with
that of the CPU.
27
As mentioned earlier, support for double-precision operations is not a common denominator
for the development and test machines. Algorithms are therefore implemented using single-
floating point precision, which is supported by all GPU devices, the CPU and GPU.NET.
5.3 Optimisation
The aim is to implement and optimise three linear algebra algorithms for the Cuda
architecture. The method for doing so is composed of the following steps:
1. Use an existing, well-known and well-documented algorithm, for implementation in
C/C++ for CPU processing.
2. Analyse and update CPU implementation to Cuda C, while making simple
improvements that exploits the parallelised architecture.
3. Test the implementations.
4. Optimise based on the test results, and test again.
5.3.1 Strategy
The Nvidia paper on “Analysis Driven Optimization” [10] identifies four categories of what can
limit a kernels performance; memory throughput, instruction throughput, latency or a
combination of the above. There are some methods that can be helpful in finding the limiting
factors of a Cuda program. To determine if memory throughput is a limiting factor, the CGMA
of a kernel can help determine the theoretical maximum performance of a kernel. When it
comes to instructions, the Nvidia profiler can give valuable information about undesirable
code.
Please refer to appendix E for at description of the Cuda profiler and CGMA.
There exist different optimisation techniques and methods, and some have already been
described in the chapters 3.1.4 and 3.1.5. In the following I will describe methods that form
the optimisation strategy.
Algorithms that process data rely on memory to perform well. Coalescing memory access is
important for all memory types, and in addition to this, shared memory should avoid bank
conflicts as much as possible.
A loop structure in a kernel adds extra control flow instructions, which will consume
arithmetic resources. The organisation of threads in several dimensions can enable the
unrolling of a loop by increasing thread granularity. The compiler already unrolls small loop
structures, but doing it manually can help making a kernel run faster.
28
Whether the block size should be high or low depends on the kernel, but it should where
possible by a multiplier of the warp size (currently 32), to avoid empty threads.
Hiding latency of slow instructions can be achieved by reorganising the kernel, exploiting data
prefetching or making sure that the kernel Cuda occupancy is high. Notice that a high
occupancy is not equal to high performance [6][7].
Vasily Volkov has shown that a kernel can increase performance, by instead of outputting a
single result, then outputting several results per kernel. It has also been proved that using
this method, high performance can be achieved with a low occupancy.
Another technique for increasing performance focuses on using fast memory, such as the
register or shared memory. Updating an implementation so that it divides data into smaller
pieces, called tiles, that fits into caches or shared memory can be very effective. The cost of
copying data is amortised, and the kernel will process the cached data.
Some algorithms are not designed for parallel processing, and the performance they can
deliver on a parallel architecture, is not very high. Instead, for some algorithms, a block
version has been designed. A block algorithm usually has three advantages over normal
algorithms. Firstly, they are able to solve much larger problem, by dividing the problem into
smaller pieces and solving them independently. Secondly, dividing a problem into smaller
sizes is the core of the tiling implementation strategy, so using a block algorithm can
automatically enables the tiling. Lastly, the block algorithms sometimes rely on other linear
algebra operations that are highly parallel, for instance matrix-multiplication.
So to clarify, tiling refers to a specific implementation that exploits a faster memory type,
whereas block refers to the algorithm. For some algorithms block and tiling is almost the
same (e.g. matrix-multiplication), for others they are not.
29
6 Matrix-multiplication
A matrix is essentially a rectangle array of numbers, and is often denoted with a capital
letter. Here the matrix A with two rows and three columns is shown.
� = ���� ��� ������ ��� ���� The numbers or values in a matrix are called elements, and are by convention denoted � where r is the row index and c is the column index. The row index indicates in which row the
element lies, where the column index indicates the column in which the element lies.
Matrix-multiplication, also called matrix product, is a linear algebra matrix operation
consisting of the operations multiplication and addition. Elements in the respective matrices
are aligned, multiplied, added, and then the grand sum is placed into the resulting matrix.
The process of performing matrix-multiplication on two matrices is only possible, if their
dimensions conform for multiplications, meaning the number of columns of the first matrix
should be equal to the number of rows in the second. The resulting matrix will be a � × matrix, where � is the number of rows of first matrix and is the number of columns of the
second.
Figure 5 - Matrix-multiplication process depicted
30
It should be further noted that this operation is not commutative, hence � × � ≠ � × �. Except for special cases, where matrix-multiplication actually is commutative. These cases
are however not described in any further details, as they are outside the scope of this report.
The naive process of matrix-multiplication is rather simple. The data size of two square
matrices is 2 ∗ � and the running time is O( �) where n is the width and height of the
matrices. This shows that the running increases more than the data size.
The simple or naive implementation will be discussed and shown later, but there exist other
algorithms which are more efficient, for instance the Strassen's or Coppersmith–Winograd
algorithms [11]. However these algorithms add complexity to the implementation, and
require extra attention to handling numerical stability issues.
This project’s matrix-multiplication focus should be on optimisations for the GPU platform,
and not the algorithm itself. Then, for that reason, will only the simple implementation serve
as a base for analysis, implementation and testing, and not the other algorithms mentioned.
Optimisations applied to the simple algorithm will focus on capabilities and properties of the
GPU platform, and the essence of the original matrix-multiplication algorithm, will be kept.
Futhermore, Cuda is the focus of this project, meaning the implementation and optimisations
will be focussed on Cuda and the C/C++ implementations. GPU.NET will be used in the result
section as a perspective and for comparison.
In the following the matrix named A will always reference the first matrix of the matrix-
multiplication process. The second matrix will be named B and the resulting matrix C, like so:
� ∗ � = �
6.1 Analysis
Parallel processing on a GPU platform is stream based, and supports the parallelisation of
data very well. The simple nature of the matrix-multiplication algorithm makes the
implementation, for processing on a GPU platform, straightforward. The fact that the
algorithm has running time of O( �), makes optimisations and performance gains easier to
test and time on different platforms, and with different data sizes.
6.1.1 The sequential algorithm
As mentioned earlier, the focus will be on the simple matrix-multiplication algorithm. The
sequential algorithm consists of the following steps:
31
1. for (int i = 0; i < A.rowCount; ++i) 2. for (int j = 0; j < B.columnCount; ++j) {
3. double sum = 0; 4. for (int k = 0; k < A.columnCount; ++k) {
5. double a = A[i][k];
6. double b = B[k][j]; 7. sum += a * b;
8. }
9. C[i][j] = (float)sum;
10. } 11. }
This algorithm consists of three loops and the inner loop has an addition and a multiplication
operation. This also shows that the running time is O( ∗ � ∗ �) where n is the number of rows
in A, m the number columns of A and rows in B, and lastly p is the number of columns in B.
For square matrices where = � = � the running time is O( �). The inner loop computes the dot-product of the vectors of A and B. The two outer loops are
responsible for iterating through the rows of A and columns of B, and their mutual order does
not influence the running time.
6.1.2 Parallelism
The simple matrix-multiplication algorithm consists of three loops. One adjustment to induce
concurrency is to perform the outer loop in parallel; another to calculate each value of the
resulting matrix in parallel.
In any case, there is not just one single solution to making matrix-multiplication work in
parallel, but multiple. These different approaches will be discussed in the following.
The outer loop
A simple approach to make the outer loop work in parallel is to make threads handle each
row in A. This is possible as there are no synchronization issues to handle, but it means that
the total number of required threads will be equal to the number of rows in A.
This adjustment is doable; however any performance gains or losses are dependent on the
data size. Consider the case where A is a column matrix and B a row matrix, this would mean,
many threads that do little work.
Resulting matrix values
Calculating the different values of the resulting matrix is another way of making the
algorithm work in parallel. By using this method the required number of threads will be equal
to the size of the resulting matrix C, which is:
32
����� = ������� ∗ ����� � �� This shows that this approach also is prone to the problems where A is a column matrix and B
a row matrix. Again, many threads do little work. For simplicity, I will not handle this case
specifically, but will later present an algorithm that performs better with different data
sizes.
6.2 Simple algorithm
Initially a simple and straightforward matrix-multiplication algorithm is implemented and
tested. Based on the results and the properties and capabilities of the Cuda architecture,
different optimisation techniques are implemented, tested and then evaluated.
6.2.1 The algorithm
The sequential algorithm described in the analysis is used to calculate the resulting matrix on
the CPU. This reference matrix will be used as a comparison to the GPU calculated result
matrix, and as such function as the correctness test.
As described in the analysis, the algorithm can be made parallel by making the outer loop or
the calculation of the resulting matrix values processed in parallel. These methods are
straightforward but have some drawbacks, meaning the performance is dependent on the
data size. The solution is to find the best balance between threads and their workload, and
the means is segmentation of data in blocks.
The kernel to the first solution looks like this:
1. __global__ void matrixMultiplicationSimple(matrix *a, matrix *b, matrix *c) {
2.
3. // Thread ID 4. int tid = threadIdx.x + blockIdx.x * blockDim.x;
5. double sum, av, bv;
6.
7. if (tid < a->height) { 8.
9. for (unsigned int j = 0; j < b->width; ++j) {
10. 11. sum = 0; 12. 13. for (unsigned int k = 0; k < a->width; ++k) { 14. av = a->n[tid * a->width + k]; 15. bv = b->n[k * b->width + j]; 16. sum += av * bv; 17. } 18. c->n[tid * b->width + j] = (float)sum;
33
19. } 20. }
}
What is important to note is that the kernel has a loop in a loop making the running time of a
single thread O( ∗ �) where n is columns in B and m the columns in A.
The kernel that calculates the values of the resulting matrix does not have this double loop.
1. __global__ void matrixMultiplicationRM(matrix *a, matrix *b, matrix *c) {
2.
3. // Matrix C coordinates 4. int c_column = blockIdx.x * blockDim.x + threadIdx.x;
5. int c_row = blockIdx.y * blockDim.y + threadIdx.y;
6. double sum, av, bv; 7.
8. // Make sure not to exceed C boundaries
9. if (c_row < c->height && c_column < c->width) {
10. 11. sum = 0; 12. 13. for(int i=0; i < a->width; i++) { 14. 15. av = a->n[c_row * a->width + i]; 16. bv = b->n[i * b->width + c_column]; 17. sum += av * bv; 18. } 19. 20. c->n[c_row * b->width + c_column] = (float)sum; 21. }
}
The two kernels are different in the sense that the first does more work than the last. In the
last kernel, a loop structure was unrolled, the firstly makes the threads more fine-grained,
which have a higher parallel potential. Secondly, the control flow instructions from the loop
are not performed, releasing more resources for the kernel.
6.2.2 Test and results
Testing is performed on different platforms, and to dedicate most performance possible of
the GPU to the algorithm, rather than rendering of the results, the tests are executed using a
console program.
Figure 6 – The output of the c
Matrix-multiplication is
matrix C will be 200 ×Outer loop
The Cuda occupancy calculator showe
have a multiprocessor occupancy of 83%.
of the matrix structure as
The kernel running time is
the time to perform data transfer. The
whether it is feasible to perform matrix
only whether the GPU calculates the r
measuring the calculation time, it is easy to directly compare the
GPU with that of the CPU.
Platform Kernel running
Cuda
Table 4 - Test result of outer loops matrix
34
The output of the console testing program
is tested where matrix A is 200 × 400 and B is 400800. The Cuda occupancy calculator showed that the simple outer loop implementation would
have a multiprocessor occupancy of 83%. The outer loop implantation was tested
of the matrix structure as a parameter for the kernel.
The kernel running time is, for Cuda, the direct GPU calculation time; hence it is exclusive
the time to perform data transfer. The Cuda kernel running time does not say anything about,
whether it is feasible to perform matrix-multiplication on the GPU compared to the CPU, but
only whether the GPU calculates the resulting matrix faster than that of the CPU.
the calculation time, it is easy to directly compare the computation
GPU with that of the CPU.
Kernel running
time
Operations/ms Gigaflops/sec
349.15 ms 366.609 0
Test result of outer loops matrix-multiplication on platform #1
400 × 800. The resulting
d that the simple outer loop implementation would
The outer loop implantation was tested with the use
ation time; hence it is exclusive
kernel running time does not say anything about,
multiplication on the GPU compared to the CPU, but
esulting matrix faster than that of the CPU. By just
computation time on the
/sec CPU running
time
0.37 250.90 ms
multiplication on platform #1
35
Some interesting results have emerged from this initial test. First of all, there is a difference
in the result calculated on the GPU from that calculated on the CPU. The maximum
difference in the resulting values is 0.010742 performed on platform #1. The GPU
architecture was initially designed for increased speed, on the cost of precision which partly
explains the difference in the resulting values. Newer architectures implement an instruction
set with increased precision, this kernel have been tested on an architecture with compute
capability v2.0, where the difference between GPU and CPU results was 0.0.
Another surprise to see from the test, the GPU calculation actually has a peak performance of
0.3666 gigaflops, and takes longer than on the CPU. What causes such a bad performance
might one ask?
Outer loop without structure
Global reads are expensive and coalesced memory reads should be achieved to optimise
performance. Structures can, if not aligned, produce non coalesced memory access.
Whether using structures as parameters for the kernel had any impact on performance, would
be interesting to test. So, minor adjustment to the code where made to eliminate structures
as parameters, and the updated kernel function definition now looked like this:
1. __global__ void 2. matrixMultiplicationSimpleNS(float *a, float *b, float *c, int aheight,
int awidth, int bwidth)
The adjustment meant that the occupancy of the multiprocessor rose from 83% to 100%, an
increase, which suggested that better performance could be expected. But the kernel
calculation running time was tested to 357.88 ms. A running time that is approximately the
same as using structure parameters.
Even though a higher occupancy suggested increased performance, no performance gain was
achieved. This result is confirmed by Vasily Volkov test on “Better performance at lower
Occupancy” [6].
So the first optimisation will look into reorganising the threads, to try if Cuda performs better
when more threads perform less, than when few threads perform more. But before doing so,
testing and comparing performance with GPU.NET would indicate, whether the GPU.NET API
performs on the same level as using Cuda directly.
GPU.NET
The GPU.NET platform has different limitations; one is that kernel methods only support
primitive types as parameters. Of that reason, testing matrix-multiplication with the Matrix
36
structure is not possible. So the outer loop matrix-multiplication method was implemented
without structures, and can therefore be directly compared to the similar Cuda
implementation.
It is furthermore not possible, on the GPU.NET platform, to measure solely the direct
calculation time, the kernel running time is inclusive data transfer and JIT compilation.
Platform Kernel running time Operations/ms Gigaflops/sec CPU running
time
Cuda 357.88 ms 357,661 0,36 381.23 ms
GPU.NET 409.00 ms* 312.958 0.31 387.00 ms
Table 5 - Test result of outer loops matrix-multiplication no structure on platform #1
* Inclusive data transfer and JIT compilation
Taking data transfer and JIT compilation into account, the performance of GPU.NET and Cuda
are almost identical. It will be interesting to see whether this is also the case, when different
optimisation techniques and features are exploited.
6.3 Optimisation
The Cuda architecture has different characteristics and capabilities, and to optimise
performance different features and techniques can be utilised. First the unrolling of a loop
will be tried, after which the tiling and other methods from the strategy will be applied.
6.3.1 Unroll loop with threads
The simple implementation was designed so fewer threads performed more, the first
optimisation will try and uncover if more threads performing less, by unrolling a loop,
actually is better. This can be achieved by modifying the algorithm, so each thread calculates
a value in the resulting matrix.
37
The modified kernel is shown in the following:
1. __global__ 2. void matrixMultiplicationRM(matrix *a, matrix *b, matrix *c) {
3. 4. // Matrix C coordinates
5. int c_column = blockIdx.x * blockDim.x + threadIdx.x;
6. int c_row = blockIdx.y * blockDim.y + threadIdx.y;
7. double sum, av, bv; 8.
9. // Make sure not to exceed C boundaries
10. if (c_row < c->height && c_column < c->width) { 11. 12. sum = 0; 13. 14. for(int i=0; i < a->width; i++) { 15. 16. av = a->n[c_row * a->width + i]; 17. bv = b->n[i * b->width + c_column]; 18. sum += av * bv; 19. } 20. 21. c->n[c_row * b->width + c_column] = (float)sum; 22. } 23. }
Test and result
A similar modification was made to the kernel in GPU.NET and the running times are shown in
the following table.
Platform Kernel running
time
Operations/ms Gigaflops/sec CPU running
time
Cuda
(outer loop) 349.15 ms 366,609 0.37 250.90 ms
Cuda 53.04 ms 2,413,297 2.41 253.14 ms
GPU.NET 143.00 ms* 895,104 0.90 372.00 ms
Table 6 - Test result of matrix-multiplication for resulting matrix on platform #1
* Inclusive data transfer and JIT compilation
The running time of both the GPU.NET and Cuda implementation has decreased. GPU.NET
performs about 2.86 times better than the GPU.NET outer loop implementation, however the
38
performance increased is disappointing when comparing the performance gain when purely
using Cuda. When looking solely at the Cuda implementation the performance increase is
almost 6.6 times better than the outer loop approach.
So even though the GPU.NET performance, compared to Cuda, is disappointed, the
performance gains are significant, and indicate indeed that more threads doing less by
unrolling a loop is a reasonable approach.
When programming for parallel execution on the CPU platform, it is important to use the
correct amount of threads to solve the problem optimally, and as spawning a thread is
expensive not to many threads should be used. The results from these tests show that this
rule of thumb does not apply to the GPU platform. The overhead for creating a thread in
Cuda, is far less than that of creating threads on the CPU.
Another factor is the amount of global memory reads. Reading from global memory is
expensive and very slow [5]. When looking at the code of the kernel, it shows that the inner
loop makes two global memory reads and one multiplication and addition operation. This
equals a CGMA ratio of approximately 1.0.
On platform #1 the global memory has a peak performance of 16.6 GB/sec bandwidth. With 4
bytes in each single-precision floating-point value, the expected giga single-precision data per
second is 4.15 (16.6/4). With a CGMA ratio of 1.0, this kernel will not execute at no more
than 4.15 gigaflops [4].
So in short, this kernel is memory-bound and to optimise a memory bound kernel, the focus
should be on global memory access. One method for doing this is to
6.3.2 Tiling v1
One of the fastest memory types on a Cuda device, is the shared memory. The shared
memory is on-chip and very fast, but also limited. Shared memory is accessible and shared by
all threads in a block, so it is obvious to use it as a block cache.
One strategy for reducing global memory traffic is to partition data into tiles that will fit into
the shared memory. Then load a tile of data from device memory into shared memory,
process the data and lastly write the results back to device memory [2]. One important
criterion is that the computation on these tiled data must be able to, be processed
individually.
This requires the threads in a block to be synchronised, as shown in the following kernel code:
39
1. __global__ 2. void matrixMultiplicationTILINGns(float* a, float* b, float* c, int
aWidth, int bWidth) { 3.
4. // blockDim.x = TILING_DIM (last is defined and hence faster)
5. // blockDim.y = TILING_DIM (last is defined and hence faster) 6.
7. int bx = blockIdx.x;
8. int by = blockIdx.y;
9. int tx = threadIdx.x; 10. int ty = threadIdx.y; 11. 12. // Matrix C coordinates 13. int c_column = bx * TILING_DIM + tx; 14. int c_row = by * TILING_DIM + ty; 15. 16. // Calculate the first index in of row in a, and the last for the 17. // current thread 18. int aIdxBegin = c_row * aWidth + tx; 19. int aIdxEnd = aIdxBegin + aWidth - 1; 20. int bIdxBegin = c_column + bWidth * ty; 21. 22. float sum = 0.0; 23. 24. for (int aIdx = aIdxBegin, 25. bIdx = bIdxBegin; aIdx <= aIdxEnd;) { 26.
27. __shared__ float ac[TILING_DIM][TILING_DIM]; // A cache 28. __shared__ float bc[TILING_DIM][TILING_DIM]; // B cache 29. 30. // Load values to cache 31. ac[tx][ty] = a[aIdx]; 32. bc[tx][ty] = b[bIdx]; 33. 34. // Synchronze to make sure all threads in block have saved 35. // values to the shared memory for this phase 36. __syncthreads(); 37. 38. for (int i=0; i < TILING_DIM; ++i) { 39. sum += ac[i][ty]*bc[tx][i]; 40. } 41. 42. // Synchronise to make sure that computation are done 43. __syncthreads(); 44. 45. aIdx += TILING_DIM; // Add index by phase dimension 46. bIdx += TILING_DIM*bWidth; // Add index by phase dimension and 47. // b width 48. } 49. 50. // Insert dot-product in resulting matrix
51. c[c_row * bWidth + c_column] = sum;
52. }
Looking at the Cuda kernel the CGMA is calculated by:
40
$%� ∗ &1�%% + 1� ������)�� * ∶ 2,�� Where $%� stands for block dimension size and is the axis size of one dimension in the block. ,�� stands for global memory read and is the number of accesses to global memory.
The block dimension was set to 20, giving a CGMA of 20, and with a giga single-precision data
per second of 4.15 for platform #1, the immediate kernel peak performance is calculated to
83 gigaflops. This is an impressive theoretical peak performance of this kernel when taking in
to account that the global memory has a bandwidth of 16.6 GB/sec. However the GPU of
platform #1 has a peak performance of 34.38 gigaflops and the kernel is limited by that, so
the theoretical maximum performance of this kernel on this platform is 34.38 gigaflops.
Showing that the kernel on platform #1 is limited to 34.38 gigaflops proves that the tile-
strategy kernel algorithm is no longer memory-bound, but actually arithmetic-bound. This is
theoretically true, but the picture might be different when the test has been performed and
the result is ready.
Test and result
To test whether using the matrix structure as parameter had any impact on performance in
Cuda, I implemented two tiled kernels. The first one used the matrix structure as parameter
and the second used pointer arrays.
GPU.NET supports shared memory as well, however only arrays with one dimension were
supported. So the shared memory indexes in the source code were adjusted, to align the
arrays sequentially. Besides this minor adjustment the Cuda kernel was easy to port to
GPU.NET and the test result are shown in the following table.
Platform Kernel running
time
Operations/ms Gigaflops/sec CPU running
time
Cuda 39.52 ms 3,238,368 3.24 257.92 ms
Cuda
(No struct) 35.96 ms 3,559,595 3.56 254,48 ms
GPU.NET 76.00 ms* 1,684,210 1.68 357.00 ms
Table 7 - Test result of matrix-multiplication for tiling strategy on platform #1
* Inclusive data transfer and JIT compilation
41
The block has two dimensions with the length of 20, this gives 20 ∗ 20 = 400)ℎ���%����$���.. The normal recommendation is to make the block size
dividable by the warp size, currently 32. However these tests were also performed with a
block size of 16 ∗ 16 = 256)ℎ���%����$���., but the peak performance for the Cuda kernel,
was about 1.6 gigaflops. The conclusion is, sometimes it pays of not following the
recommendation. In this case, the overhead of filling the warp with empty padded threads is
insignificant, when compared to the larger amount of coalesced memory reads, the larger
block size results in.
By using the tile strategy and shared memory, it was possible to perform matrix-
multiplication even faster than the resulting matrix algorithm. Cuda was about 1.47 times
faster and GPU.NET was faster by a factor of 1.88.
Looking at the peak performance, the result indicates that even though the algorithms are
the same, then GPU.NET have no chance of performing on the same level as when Cuda is
used directly. This is most likely due to the fact that GPU.NET JIT compiles the device code.
The Cuda kernel has a peak performance of 3.56 gigaflops which is remarkably slow,
compared to the theoretical 34.38 gigaflops. This gives an actual performance that is just
10.35% of the theoretical possible. And even though the kernel algorithm is arithmetic-bound,
due to this significant slower performance, the performance limiting factors can in fact be
both arithmetic and memory.
6.3.3 Tiling v2 with latency hiding
When a kernel does not reach the expected performance level, a good place to start is to
analyse the kernel with focus on coalesced memory access. Note that even though shared
memory is fast, access should still be optimised with regards to coalescing access. The
following kernel is the result of such an analysis:
1. __global__ void matrixMultiplicationTILINGns_v2(float* a, float* b, float* c, int aWidth, int bWidth) {
2.
3. // Declare cache
4. __shared__ float ac[TILING_DIM][TILING_DIM];
5. __shared__ float bc[TILING_DIM][TILING_DIM]; 6.
7. // Calculate Matrix C coordinates
8. const int c_column = blockIdx.x * TILING_DIM + threadIdx.x; 9. const int c_row = blockIdx.y * TILING_DIM + threadIdx.y;
10. const int cidx = c_row * bWidth + c_column; 11. 12. // Calculate the first index in of row in a, and the last for the 13. // current thread 14. const int aIdxBegin = c_row * aWidth + threadIdx.x;
42
15. const int aIdxEnd = aIdxBegin + aWidth - 1; 16. 17. float sum = 0.0; 18. 19. for (int aIdx = aIdxBegin, 20. bIdx = c_column + bWidth * threadIdx.y; aIdx <= aIdxEnd;)
{
21. ac[threadIdx.y][threadIdx.x] = a[aIdx]; 22. aIdx += TILING_DIM; // Increase a index 23. 24. bc[threadIdx.y][threadIdx.x] = b[bIdx]; 25. bIdx += TILING_DIM*bWidth; // Increase b index 26. 27. // Synchronze to make sure all threads in block have saved 28. // values to the shared memory for this phase 29. __syncthreads(); 30. 31. // Compute dot-product 32. for (int i=0; i < TILING_DIM; ++i) { 33. sum += ac[threadIdx.y][i]*bc[i][threadIdx.x]; 34. } 35. 36. // Synchronise to make sure that computation are done 37. __syncthreads(); 38. } 39. 40. // Insert dot-product in resulting matrix
41. c[cidx] = sum;
42. }
To optimisation this v2 kernel, four register variables have been removed to increase the
number of possible active warps. Furthermore, the shared memory access in the lines 21 and
25 has been optimised for coalescing. This now yields a peak performance of 5.028 gigaflops
on platform #1.
Nvidia provides a code example implementation of matrix-multiplication using the tile-
strategy. This kernel has a peak performance of 4.91 gigaflops, so with these minor updates it
is possible to get a kernel to performing better.
6.3.4 Tiling v3 with prefetching
Access to global memory is limited by bandwidth and high latency. The high latency of access
to global memory makes kernel execution halt until the data is served. By organising the code
such that when a thread is waiting for data, other instructions can be executed is called
latency hiding.
One way of hiding latency from memory access is to exploit data prefetching. This basically
works by pre-fetching data while the current data is being processed. The following pseudo
code shows the steps of using pre-fetching for matrix-multiplication:
43
1. __global__ void mm_prefecth(float* a, float* b, float* c, int aWidth, int bWidth) {
2. 3. // Load data from global memory to register variables
4.
5. while(data to process) { 6.
7. // Insert register values to shared memory
8.
9. // Synchronze threads 10. 11. // Prefetch next values to register 12. 13. // Calculate the dot-product 14. 15. // Synchronise to make sure that computations are done 16. } 17. 18. // Insert dot-product in resulting matrix 19. }
By exploiting prefetching the peak performance of matrix-multiplication increased for system
#1 and #4. 5.68 gigaflops was reached for system #1, while system #3 surprisingly incurred a
performance loss of 2.2 gigaflops.
6.3.5 Tiling v4 and v5 with more output per thread
Vasily Volkov from UC Berkeley has looked deeper into “Better performance at Lower
Occupancy” [6]. He has shown that it is possible to increase performance of matrix-
multiplication by making a thread compute several dot-products, instead of one.
Volkov points in particular two things out. By making a thread compute more output, a thread
can reuse values in the register for several computations. Registers are faster than both
shared and global memory and by grouping the work of several threads together, it is possible
to use the register for data sharing. The hypothesis is that for memory heavy kernels, it is
beneficial to let fewer threads carry a higher workload, to exploit the fast register for data
sharing.
In the matrix-multiplication kernel this has another advantage, as the dot-product is being
calculated between a single column in B and several rows in A, the memory read access to
this specific column in B is reduced by 3 42 . Consider the following inner loop, which computes
the dot-product for the different rows. The green memory read operations is the same and
hence automatic cached.
44
1. // Calculate the dot-product 2. for (int i=0; i < TILING_DIM; ++i) {
3. sum[0] += ac[threadIdx.y][i] * bc[i][threadIdx.x]; 4. sum[1] += ac[threadIdx.y+5][i] * bc[i][threadIdx.x];
5. sum[2] += ac[threadIdx.y+10][i] * bc[i][threadIdx.x];
6. sum[3] += ac[threadIdx.y+15][i] * bc[i][threadIdx.x];
7. }
I made two tests, one where a single thread computes two dot-products, and another where a
thread computes four dot-products. The results of platform #1 were a bit surprising and have
therefore also been tested on platform #4.
Kernel/System #1 (CC v1.1) #3 (CC v1.3) #4 (CC v2.0)
Tiling v3
Prefetching
5.68 gigaflops
(Occupancy: 54%)
113.52 gigaflops
34.57 gigaflops
(Occupancy: 81%)
Tiling v4
2 outputs/thread
4.45 gigaflops
(Occupancy: 58%)
96.51 gigaflops
36.89 gigaflops
(Occupancy: 88%)
Tiling v5
4 outputs/thread
5.89 gigaflops
(Occupancy: 67%)
113.83 gigaflops
37.13 gigaflops
(Occupancy: 67%)
Table 8 - Tiling with 2 and 4 outputs per thread comparison for different platforms
The performance expectations set by the Volkov slides say that several outputs/thread could
perform better than 1 output/thread. The results from platform #4 are the only one following
this pattern, ending with a peak performance of 37.13 gigaflops. This performance is
achieved by having an occupancy level of 67%, confirming that higher performance can be
achieved with a lower occupancy rate.
But it is evident that using Volkov’s suggestion is not a certain measure for success. It is
interesting to see that the performance of the 2 outputs/thread method actually was lower
than expected for system #1 and #3, and the 4 outputs/thread on these systems yielded
results similar to those when not using Volkov’s suggestions at all.
The major difference between system #1 and #3 on one side and #4 at the other is the
different GPU devices compute capability level. System #1 and #3 belongs to 1.x where #4
belongs to the 2.x generation. This might be the reason that the algorithm performs relatively
different on different architectures. I will in the next paragraph look into whether the CC
level has any influence on the performance of al algorithm.
6.3.6 Cuda compute capability
The features, capabilities
capability. The first Nvidia graphic cards were released with a
v1.0 to v1.1, the newest card today are released with a
a superset of features of those at a lower lev
generation GPU.
The newer CC levels both add new features and improve on existing. One of the important
factors in performance
been improved in many ways for CC 2.0 and 2.1
need use as much energy on
it easier to port existing
The following table shows the
different CC levels. The results can also be found in appendix F.
Figure 7 - Performance of kernels executed for different CC levels on
0
5
10
15
20
25
30
35
40
Gig
afl
op
s
CC 1.1 in gigaflops
45
Cuda compute capability
capabilities and instruction set of a GPU are specified by its compute
The first Nvidia graphic cards were released with a compute capability
, the newest card today are released with a CC level at 2.1.
a superset of features of those at a lower level [4], and a higher level also
both add new features and improve on existing. One of the important
in performance is the performance impact of non-coalescing m
been improved in many ways for CC 2.0 and 2.1, which means that the developer does not
use as much energy on designing an algorithm with memory coalescing in mind, making
existing algorithms.
The following table shows the peak performance in gigaflops of different
The results can also be found in appendix F.
Performance of kernels executed for different CC levels on
CC 1.1 in gigaflops CC 1.3 in gigaflops CC 2.0 in gigaflops
specified by its compute
compute capability level from
level at 2.1. A higher level defines
also indicates a newer
both add new features and improve on existing. One of the important
memory access. This has
means that the developer does not
with memory coalescing in mind, making
different kernels targeting
Performance of kernels executed for different CC levels on platform #4
CC 2.0 in gigaflops
46
It was expected that the kernels would be best performing on higher CC levels, and the rule
of thumb is to target the highest CC level possible when compiling kernels, to take advantage
of the newest optimisations and features. This is true except for the last kernel, where CC
levels 1.1 and 1.3 have a peak performance of 39.47 gigaflops, which is faster than the 37.06
gigaflops that CC 2.0 delivers.
Kernels executed on CC levels 2.0 are in generally between 5% and 10% faster than lower
levels, except the last case described above that is 6.11% slower. I have not been able to find
any explanation why the rule of thumb is not valid for this specific case, but even though this
exception breaks the rule, I do still recommend compiling for the highest CC level possible.
6.4 Evaluation
The key to a good performing kernel is memory coalescing and latency hiding. The first step
should be to structure the algorithm so that most possible memory coalescing is achieved.
Tiling has proven a very good strategy for increasing memory coalescing. A memory access
limitation is due to the DRAM memory design; so memory coalescing optimisation techniques
should also be applied to shared memory.
Using matrix-structures as parameters was initially thought of as a good abstraction, but test
shoved that performance losses where the result. It is therefore recommended to use
primitive variables or pointers in the kernel function definitions.
Data prefetching combined with operations reordering in the kernel was used to hide latency,
and gave diverging results. On three systems a performance of about 1 gigaflop was achieved,
but on the Tesla C1060 device, a loss of 2.2 gigaflops was the result. Maybe the hardware of
the Tesla card is already optimised with regard to hiding this type of latency, so trying to
handle this in the kernel counteract these hardware optimisations. In any way, this shows
that data prefetching should be carefully applied to a kernel.
Volkov presented ideas that latency-hiding and the exploitation of registers can achieve a
higher peak performance. This proved to be true, and by making a thread do more work was
it possible to make a kernel perform even better. This optimisation technique furthermore
showed that the occupancy rate should not necessarily be relied on for an optimisation
strategy.
Control flow is another factor to keep in mind when designing a kernel. As the Cuda
architecture is a SIMT, and if a kernel has a complex control flow, then several runs by the
warp scheduler can be necessary to complete the warp. This can unwarrantedly result in a
longer computing time.
47
The numerical precision for the same operations processed on a GPU and a CPU does not
always yield the same result. This is especially true for older architectures that have a
compute capability levels between v1.0 and v1.3. Newer architectures better support the
IEEE 754 standard and yields in many cases a result with better precision. In the tests of
matrix-multiplication the maximum differences in values was 0.013. To minimise this
inaccuracy, special and slower intrinsic functions can be used in the kernel. These functions
have less deviations from IEEE 754 and forces the compiler not to use FMAD instructions,
which are fast multiply-add instructions, but imprecise.
48
7 LU decomposition
LU-decomposition, also called LU-factorisation, is a linear algebra matrix decomposition of a
matrix A in the form:
� = 34
Where L and U are lower and upper triangular matrices [17]. If the LU factorisation is known,
it can be used to solve matrix-vector linear equations in two steps:
�5 = $ 6)��1:38 = $6)��2: 45 = 8
Decomposing matrix A to a product of L and U can be achieved by using an enhanced version
of Gauss elimination. Only a square matrix can be decomposed, and the L and U matrices are
of the same size, as shown here:
9��� ��� ������ ��� ������ ��� ���: = 9 1 0 0��� 1 0��� ��� 1: 9 �� �� ��0 �� ��0 0 ��:
Note that L is a unit one matrix, meaning the diagonal elements �;; of L are all one. Stewart
provides a sequential algorithm that builds on Gauss elimination, which also creates a lower
unit one matrix.
7.1 Analysis
Stewart designs an LU-decomposition algorithm and provides the code that overwrites the
matrix A with its LU factorization [17].
Pivoting or row interchanges may be required for two reasons, firstly to ensure the existence
of a LU factorisation and secondly to increase the numerical stability of the Gaussian
elimination algorithm [18]. For simplicity algorithms without pivoting will initially be
analysed, but when testing, only algorithms that implement partial pivoting will be used. This
makes sure that the performance and correctness of the individual algorithms can be
compared.
49
7.1.1 The sequential algorithm
The sequential algorithm, without pivoting, consists of three loops that overwrite the existing
matrix m. The matrix is vectorised and the index of an element is found by � ∗ � + � where r
is the row index, w the width of the matrix and c the column index.
1. // Core algorithm for LU Decomposition 2. for (int k = 0; k < n; k++)
3. {
4. for (int i = k + 1; i < n; i++) 5. {
6. // Compute scale factor Rik
7. float Rik = (m[i * mWidth + k] /= m[k * mWidth + k]);
8. 9. // Subtract row k elements from row i elements with the
10. // Rik scale factor 11. for (int c = k + 1; c < n; c++) 12. { 13. m[i * mWidth + c] -= Rik * m[k * mWidth + c]; 14. } 15. } 16. }
The code shown above is without pivoting for simplicity reasons. The sequential
implementation is simple, consisting of three loops using the operations divide, multiplication
and addition. The data size of the square matrix is � and requires � 32 additions and
multiplications, and � 22 divisions to complete. Ignoring the lower order term, the running
time is O( �) where n is both the width and height of the matrix [19]. This shows that the
running time increases more than the data size.
7.1.2 Parallelism
Matrix-multiplication and the characteristics of its data access meant that inducing
concurrency and exploiting data-parallelism was straightforward. The same cannot be said
about LU-decomposition, in which data dependencies between the loops makes parallelising
more complicated.
The sequential algorithm consists of three loops, and the operations performed by the two
inner loops results in a asymptotical running time as shown:
< � ,)��� = � 22 + � 32 = =( �) Where n is the width or height of the matrix. The outer loop cannot be directly performed in
parallel, as the .; iteration depends on the results from the.;>�,.;>�....� iterations.
50
Parallelism is not impossible, but the order of the outer loop is vital. Taking the outer loop
into account, the operations part of the algorithm can be written as:
? 22 + � 32@AB�
This equation says, for each step then 22 + � 32 operations are performed, and these
operations can be performed in parallel. The required number of operations when taking
parallelisation into account is:
∗ 22 + � 32C whereCis the number of processes. The optimal execution performs all tasks possible in
parallel, for this algorithm the optimal number of processes isC = − 1. But for simplicity,
let’s set it toC = . =�)������������� � ,)��� = ∗ 22 + � 32 = =( �)
So this algorithm does have a parallel potential, I will now look into whether this potential
can be utilised by the Cuda architecture.
Multipliers and row operations
The parts of the algorithm that exhibit no or little parallelism should be processed on the
CPU. This means the outer loop is processed by the CPU and the inner parts that exhibit
parallelism will be processed by the GPU. So the LU-decomposition implementation should be
processed by the CPU and GPU in correlation. The interesting point will be to see if it is
possible, when also considering data transfer, to make the GPU assist the CPU, to accelerate
the execution of the LU-decomposition algorithm.
To make the initial implementation simple, I have divided the calculation of multipliers and
the row operations into separate tasks. For each step of the outer loop, the multipliers are
calculated for the current column (line 2 to 4), after which the multipliers are used to
calculate the elements in the upper triangular matrix (line 5 to 9). These are individual task
that can be performed in parallel as shown in the following pseudo code.
51
1. For k from 1 To n-1 2. For i in { k+1, ...,n }
3. LU[i][k] = LU[i][k] / LU[k][k] 4. End
5. For j in { k+1, ...,n }
6. For i from k+1 To n 7. LU[i][j] = LU[i][j] – LU[i][k] * LU[k][j]
8. End
9. End
10. End
This algorithm is not the only method for creating the LU-factorisation, there are other
algorithms that structure the operations in the outer loop differently, and the main
difference is their memory access patterns. The performance of using different memory
access patterns may wary depending on different memory types used in the algorithm,
another factor is the whether the tasks are fine-grained or coarse-grained. For now I
recognise the existence of other algorithms, but use the one described for the simple
implementation.
7.2 Simple algorithm
A simple version of the LU-decomposition algorithm was implemented and tested. The
optimisation steps uncovered from the matrix-multiplication implementations on Cuda, will
be used to optimise the LU-decomposition implementation. Performance and correctness
tests will be performance and compared with different CPU algorithms, including with and
without pivoting.
7.2.1 The algorithm
The sequential algorithm described in the analysis is used to calculate the LU matrices on the
CPU, and serves as a comparison for the GPU computed result. This algorithm does not allow
the same level of parallelism as matrix-multiplication, but parts of the algorithm can be
parallelised. The following sample shows the code that performs the outer loop on the host,
and makes calls to kernels processed by the device. Line 7-9 shows an optional call to a
device pivoting kernel with a running time of O( ). This composition ensures that the order of
the outer loop is maintained, and the remaining tasks are performed in parallel.
1. for (int k = 0; k < (int)a->width; k++) {
2. 3. // setup execution parameters, for (int i = k + 1; i < n; i++)
4. int threads = a->width - k;
5. int gridX = (threads + THREADS_PER_BLOCK-1) / THREADS_PER_BLOCK; 6.
7. if (pivot) {
8. lud_simple_pivot<<< 1, 1 >>>( d_lu, lu->width, lu->height, k);
52
9. } 10. 11. // Calculate scale factors for column k 12. lud_simple_calc_scale_factor<<< gridX, THREADS_PER_BLOCK >>>( d_lu,
lu->width, lu->height, k);
13. 14. // Calculate new columne values with scale factor 15. lud_simple_compute_row<<< gridX, THREADS_PER_BLOCK >>>( d_lu, lu-
>width, lu->height, k);
16. 17. }
The function call in line 12, calculates the multipliers of a given column on the device. Line
15 performs the row operations with the multipliers. The kernels being called and their logic
are shown here:
1. __global__ void lud_simple_calc_scale_factor(float *lu, int luWidth, int luHeight, int k) {
2.
3. int tid = threadIdx.x + blockIdx.x * blockDim.x; 4. int i = k + 1 + tid;
5.
6. if (i < luHeight) 7. {
8. // Calculare rik scale factor and insert to Lower triangle
9. lu[i * luWidth + k] /= lu[k * luWidth + k];
10. } 11. } 12. __global__ void lud_simple_compute_row(float *lu, int luWidth, int
luHeight, int k) { 13. 14. // Id of the row 15. int tid = threadIdx.x + blockIdx.x * blockDim.x; 16. int i = k + 1 + tid; 17. 18. if (i < luHeight) { 19. 20. // Load rik scale factor, can be cached in shared memory 21. float rik = lu[i * luWidth + k]; 22. 23. // Subtract row k elements from row i elements with the Rik
scale factor
24. for (int c = k + 1; c < luWidth; c++) 25. { 26. lu[i * luWidth + c] -= rik * lu[k * luWidth + c]; 27. } 28. } 29. }
53
7.2.2 Test and results
Tests were performed with the data sizes 400, 2000, 4000, 6000 and 10,000, where the data
size is both the width and height of the matrix being decomposed. The element values are
randomly generated, which unlikely though can mean that there is no LU-factorisation result.
Pivoting, meaning row interchanges, are applied to both ensure numerical stability, but also
to ensure that a LU-factorisation do exist.
Multipliers and row operations
The test was performed on the four different platforms, as shown in the following graph.
Figure 8 – Performance of simple LU-decomposition on different platforms.
The graph indicates that the peak performance measure in gigaflops is dependent on the data
size. The performance increases for matrices of increasing sizes up to about 2000 × 2000, and after that the performance result levels for all platforms. The peak GPU performance of the
fastest platform was about 0.5 gigaflops, which compared to a peak CPU performance of 2.44
gigaflops is slow.
0,00
0,10
0,20
0,30
0,40
0,50
0,60
0 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 10.000
Gig
afl
op
s
Data size
Platform #1 Platform #2 Platform #3 Platform #4
54
Kernel invocation overhead
If pivoting is used, then for each step.;, three kernels are invoked. For a10,000 ×10,000matrix this equals to 30,000 kernel invocations. Naturally, kernel invocations will incur
overhead, but how much will 30,000 invocations influence the total result? Vasily Volkov et
al. has measured the kernel launch overhead for various systems and GPUs [20]. For
synchronised kernel invocations the times measures were between 10-14 µs, for asynchronous
kernel invocations the timings were 3-7 µs. The following table shows the kernel invocation
overhead as a ratio of the fastest running times on system #3 and #4.
System #3 #4
Fastest result (n=10,000) 15,736.55 ms 32,784.64 ms
Asynchronous (low=900 ms) 5.72% 2,75%
Asynchronous (high=2,100 ms) 13.34% 6.41%
Synchronous (low=3,000 ms) 19.06% 9.15%
Synchronous (high=4,200 ms) 26.69% 12.81%
Table 9 - Kernel invocation overhead ratio of total running time
This table shows that kernel invocations should not be disregarded when implementing an
algorithm, because their contribution to the total running time can be relatively high. This
obviously depends on the algorithm, but for this LU-decomposition implementation, the
contribution is as high as 26.69%.
It is evident that asynchronous invocations are faster than synchronous, and hence represent
a lower percentage of the total running time. So where possible asynchronies functions should
be used
GPU.NET
The LU-decomposition implementation requires a matrix to initially be copied to device
memory, then several kernel calls compute the result and updates the data, before the
matrix is copied back to the host. GPU.NET does currently not allow data on the device to be
modified by multiple kernel calls, so testing LU-decomposition through GPU.NET would not be
relevant, as data transfers would severely impact performance.
55
7.3 Block LU-decomposition
One of the first optimisation strategies, suggested by the matrix-multiplication chapter, was
to divide the problem into smaller pieces that fit into caching memory. Jack Dongarra et al.
have developed a block LU algorithm, called the right looking algorithm, that automatically
supports tiling. Please note that the term block in a block algorithm is not the same as the
blocks that are part of a Cuda grid, and used to define thread granularity. A Cuda block will
from here on be referred to as a thread block.
7.3.1 The block algorithm
By partitioning a F ×F matrix A, the factorisation LU may be partitioned as shown [22]:
The usual rules of matrix-multiplication hold for block matrices, so we can write:
G 1. �II = 3II4II2. ��I = 3�I4II3. �I� = 3II4I�4. ��� = 3�I4I� + 3��4��J
Where� × �is the block size, �II is� × �, �I�is� × (F − �), ��I is (F − �) × � and ���is(F −�) × (F − �). The first step is based on lemma 1 and 2, by performing a normal LU-decomposition on �II and ��I combined, the result is then3II, 4IIand3�I, which are then known.
Step 2 uses lemma 3 and a triangular solve method, which results in the matrix4I�. In step 3, rearranging lemma 4 gives3��4�� =��� −3�I4I� =���K , which shows that 3�� and 4�� can be found by LU-decomposing���K . This can be achieved by using the above steps
on���K .
In the F/� number of steps the matrix A has been decomposed by using a block LU-
decomposition, as depicted here. The white parts have already been solved.
56
Figure 9 – Matrix A being decomposed by block LU-decomposition in steps.
Figure 9 shows that the height and width of the matrix is F, . is the current step and block width is obviously the dimension of the current block being processed (here the green sub-
matrix). Step 1 solves the green and purple sub-matrix by regular LU-decomposition, then the
lower triangular matrix of the block (L of the green block) is used to triangular solve the cyan
sub-matrix, as the second step. In the third step, the blue sub-matrix is found by regular
matrix-multiplying the purple and cyan sub-matrices and subtracting the element values from
the current elements in the blue sub-matrix. The steps are then continued for the remaining
parts until the whole F ×F matrix is processed.
As this algorithm makes it possible to partition large matrices and solve smaller parts, and
therefore exploit shared memory, this algorithm will be implemented using Cuda and used for
testing.
7.3.2 Implementation
This algorithm, as shown above, consists of three steps, which the implementation must also
follow. The first step, to LU-decompose F × �, is covered by an optimised kernel of the
simple algorithm (lud_block_scale). This part also includes the optimised pivoting kernels
(lud_block_pivot, lud_block_pivot_L2 and lud_block_swap).
57
The second step requires a triangular solving kernel (lud_block_triangular_solve), and the
last step is regular matrix-multiplication kernel (lud_block_matrixMultiplication), which
has already been implemented and optimised in chapter 6 from page 29.
Pivoting
In LU-decomposition, pivoting is performed for each column. Instead of the simple algorithm
that had a running time proportional to , the parallel nature of Cuda can be exploited to implement a reduction pivoting algorithm with a running time of O(log(n)). This required two pivoting kernels; the first reduces the current column of the matrix and
saves the result to a temporary pivoting array on the device. The second kernel does the
same, but works on the temporary pivot array instead of the matrix.
The first kernel is shown here, and has already been optimised with focus on memory
coalescing and a sort of tiling strategy. In line 20 and 21 the individual threads loads a value
from global memory to shared memory. This data is then processed from line 27 to 37, while
threads synchronise data access for consistency. In line 42, the first thread of each thread
blocks in a grid, writes the pivoting index to the temporary array.
1. __global__ void lud_block_pivot(int *out, float *a, int M, int k, int max)
2. {
3. extern __shared__ float shared[]; 4. float* max_cache = (float*)shared;
5. int* idx_cache = (int*)&shared[blockDim.x];
6. 7. unsigned int tx = threadIdx.x;
8. unsigned int i = blockIdx.x * blockDim.x + tx + k; // Get row index
9.
10. unsigned int idx = i * M; 11. 12. // Clear cache for threads that exceeds max + they should not 13. //influence result 14. max_cache[tx] = 0; 15. idx_cache[tx] = -1; 16. 17. if (i < M) 18. { 19. // Read value + set row index 20. max_cache[tx] = abs(a[idx + k]); 21. idx_cache[tx] = i; 22. 23. // Sync threads to make sure all other also have loaded values 24. __syncthreads(); 25. 26. // Do the actual pivot finding 27. for(unsigned int stride = blockDim.x/2; stride>0; stride>>=1) 28. {
58
29. if (tx < stride && (stride+tx+k) < M && max_cache[tx] < max_cache[tx + stride])
30. { 31. max_cache[tx] = max_cache[tx + stride]; // Update value 32. idx_cache[tx] = idx_cache[tx + stride]; // Update index 33. } 34. 35. // Sync threads 36. __syncthreads(); 37. } 38. 39. // The first thread should write result from block to output 40. if (tx == 0) 41. { 42. out[blockIdx.x] = idx_cache[0]; // Load index to output 43. } 44. } 45. }
Swapping rows
If a pivoting row has been identified, the indices of the two rows are then transferred to the
device, by calling a kernel for swapping rows. By swapping the rows on the device, a transfer
of the matrix to and from the host is avoided. Several threads can be used to swap the rows,
and by aligning the memory access correctly, the memory access is coalesced.
LU-factorisation
The algorithm described by Stewart [17] is parallelised by making threads process individual
rows. To optimise the performance the .)ℎ row is loaded to shared memory and accesses by
all threads, as the grid size is 1 × 1. This means that the.)ℎrow is only loaded to shared memory ones, and the values are
accessed by all threads. With just one block, the disadvantage is that Cuda are not able to
hide memory latency access by switching to other active blocks. I will later test whether this
approach performs well, or another approach with several thread blocks performs better.
Triangular solving
Triangular solving for 4I�can be performed row- or column wise. I have chosen column wise
as the memory access of the threads in a block will be coalesced. This part is well suited for
GPU processing, because each column can be processed independently.
59
1. __global__ void lud_block_triangular_solve(float *a, int M, int k, int LU_BlockDim)
2. { 3. extern __shared__ float y[];
4.
5. int tx = threadIdx.x; 6. int tid = blockIdx.x * blockDim.x + tx;
7. int column = tid + k + LU_BlockDim;
8.
9. if (column < M) 10. { 11. for (int r = 0; r < LU_BlockDim; r++) // For each row in block 12. { 13. float res = a[(r+k) * M + column]; 14. for (int c = 0; c < r; c++) // 0<=c<r, so below diagonal 15. res -= a[(r+k) * M + c + k] * y[tx * LU_BlockDim + c]; 16. y[tx * LU_BlockDim + r] = res; 17. } 18. 19. for (int r = 0; r < LU_BlockDim; r++) 20. a[(r+k) * M + column] = y[tx * LU_BlockDim + r]; 21. } 22. }
Each thread uses shared memory to calculate the resulting column values (line 3, 15, 16 and
20). The size of the thread blocks, and thereby the needed shared memory, is not known
compile time. But as shared memory can be dynamically allocated, this is not a problem.
Shared memory is fast, but the register is even faster. If the thread block size was known on
compile time, it could prove beneficial to use the register instead of the shared memory.
Matrix-multiplication
The kernel for performing matrix-multiplication is based on tiling v3, which includes tiling,
pre-fetching and memory coalescing optimisations. For any details about this kernel please
turn to paragraph 6.3.4 on page 42.
7.3.3 Test and results
The initial block algorithm was tested on the four different platforms and with data sizes
ranging from 400 × 400 to 10,000 × 10,000 matrices. For comparison the same data sizes was
tested on the CPU.
The test with the largest matrix was not performed on platform #1, due to memory
limitations, neither was it tested by the CPU, as the running time would be too high. For
comparison I have added a qualified projection on the graph, which shows that platform #3
had a peak performance of 14.37 gigaflops for a matrix 10,000 × 10,000.
60
Figure 10 - Performance of block LU-decomposition v1 on different platforms.
The graph shows two important things. Firstly, the GPU architecture is in fact able to perform
LU-decomposition faster than the CPU, for larger matrices. The specific speed of the platform
determines when and for which data sizes, the GPU is faster, but looking at the graph, this
happens somewhere between 1,000 and 3,000. Secondly, the peak performance of the
algorithm is almost proportional to the data size, for these tests. Obviously the peak
performance cannot keep increasing proportionally to the data size; there must be an upper
limit. But it makes sense that when increases, even more operations can be performed in
parallel.
Several of the tests were also performed by the CPU as a comparison. The average difference
was 0.104211 and the maximum and minimum differences were respectively 0.332855 and
0.0. The 0.0 differences were only achieved on platform #4 with a compute capability of 2.1.
Profiling
The Nvidia Compute Visual Profiler is a tool that allows profiling of a Cuda program. The GPU
time summary plot indicates which part of an algorithm that could be optimised with most
effect. The following figure shows how much computing time each kernel uses.
0,0
2,0
4,0
6,0
8,0
10,0
12,0
14,0
16,0
0 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 10.000
Gig
afl
op
s
Data size
Platform #1 Platform #2 Platform #3 Platform #4 CPU
61
Figure 11 - Computing time of each kernel in block LU-decomposition v1 on platform #4.
Almost 90% of the time is spent in the regular LU-decomposing kernel, so optimising this part
should have the best effect on the total running time.
7.3.4 Optimising round 1
Having determined that the kernel lud_block_scale has the best performance optimisation
potential, I will now analyse the source code to identify performance limiting factors.
The kernel is, for each .)ℎ iteration, called with one thread block and the LU block size
inthreads, which normally is 20. This means that only 20 threads are running in parallel at
any given time, but as the warp size for the G80 and GT200 architecture is 32, the active
warp is padded with 12 empty threads that do not process any data.
1. __global__ void lud_block_scale(float *a, int M, int k) 2. {
3. extern __shared__ float ac[];
4. 5. int aWidth = M;
6. int tx = threadIdx.x;
7. int end = min( blockDim.x, M-k );
8. 9. ac[tx] = a[k * aWidth + k + tx]; // Load k row to shared memory, as
// it is used across threads
10. 11. // Sync threads to make sure all other also have loaded values 12. __syncthreads(); 13. 14. for(int i = k+1 + tx; i < M; i+=blockDim.x) { // Foreach row 15. 16. // Compute scale factor Rik, 1 operation=divide 17. float rik = (a[i * aWidth + k] /= ac[0]); 18. 19. for (int c = 1; c < end; c++) // Foreach column value in row
62
20. a[i * aWidth + k + c] -= rik * ac[c]; 21. } 22. }
Another factor in this kernel is its dependency on global memory. All threads load a value
from the .)ℎrow into shared memory in line 9, both the memory read from global memory
and the write to shared memory is coalesced, so this is good. The cached values are then
used to calculate the upper and lower triangular matrices from line 14 to 20. These loops rely
heavily on global memory access, and have only few operations to hide latency.
CGMA
The core parts of the kernel are line 17 and 20. Consider line 17, a memory load and a write,
combined with a single divide operation gives a CGMA of 0.5. Line 20 has a memory load and
a write, together with the operations addition and multiply, which gives a CGMA of 1.0. On
platform #1 the global memory has a peak performance of 16.6 GB/sec bandwidth. The
expected giga single-precision data per second is 4.15 (16.6/4). Line 20 is the dominant part,
but line 17 cannot be ignored, so taking the CGMA ratio of 1.0 and 0.5 into account, this
kernel will execute at no more than between 2.075 and 4.15 gigaflops on platform #1 [4].
Hiding latency
The low CGMA suggests that this kernel is limited by memory, but in the current form, it is
possible to improve on latency hiding. One way is to assign more work to the streaming
processors and let them continue working on another warp, while the first warp waits for
data. The updated thread block size is 64, and the needed number of blocks would be
(F − .)/64, where M is the height the matrix and k the current iteration. The kernel looks
like this:
1. __global__ void lud_block_scale_v2(float *a, int M, int k, int end) 2. {
3. extern __shared__ float ac[];
4.
5. int aWidth = M; 6. int tid = blockIdx.x * blockDim.x + threadIdx.x;
7.
8. // Load k row to shared memory, as it is used across threads 9. ac[threadIdx.x] = a[k * aWidth + k + threadIdx.x];
10. 11. // Sync threads to make sure all other also have loaded values 12. __syncthreads(); 13. 14. int i = k+1 + tid; // Row index 15. if (i < M) 16. { 17. // Compute scale factor Rik, 1 operation=divide 18. float rik = (a[i * aWidth + k] /= ac[0]);
63
19. 20. for (int c = 1; c < end-k; c++) // Foreach column value in row 21. a[i * aWidth + k + c] -= rik * ac[c]; 22. }
23. }
But this is not the only benefit from this update. A for loop is a control flow element that
often is part of a kernel. When doing operation counting analysis of a kernel, the operations
contributed by for loops are often overlooked. Consider line 20 in the kernel above, for every
iteration the c++ operation and the c < end-k comparison is performed. Unrolling loops are
another way of increasing performance of a kernel, and this is exactly what has been
achieved with this kernel, compared to the former version’s line 14.
7.3.5 Test and results
This updated block algorithm was tested again, on the four different platforms and with data
sizes ranging from 400 × 400 to 10,000 × 10,000 matrices. For comparison the same data sizes
was also tested on the CPU.
Figure 12- Performance of block LU-decomposition v2 on different platforms.
000
005
010
015
020
025
030
035
0 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 10.000
Gig
afl
op
s
Data size
Platform #1 Platform #2 Platform #3 Platform #4 CPU
64
This optimised algorithm is faster than the former, now the peak performance of platform #3
is 31.51 gigaflops, which is 2.19 times as fast. This also shows that even smaller matrices can
with benefit be processed by the GPU architecture.
Several of the tests were also performed by the CPU as a comparison. The average difference
was 0.103360 and the maximum and minimum differences were respectively 0.314285 and
0.002296. The Cuda architectures with higher compute capability do not seem to have smaller
deviations from the CPU reference result.
Profiling
Focus should be on hiding latency, which can be achieved by increasing the number of active
blocks or by data pre-fetching. The following graph shows the updated computing time, when
the number of threads and blocks has been adjusted.
Figure 13 - Computing time of each kernel in block LU-decomposition v2 on platform #4.
This profiling result indicates that optimisation of the lud_block_scale and the
lud_block_matrixMultiplication kernels could have the highest performance effect. So this
is what I will look into as next step.
7.3.6 Optimising round 2
The matrix-multiplication kernel used has already been well optimised, as it is based on the
work and results from the matrix-multiplication analysis chapter 0 (from page 29). But for
simplicity reasons, this kernel was not optimised with Volkovs suggestion, several outputs per
thread. This will be the next step to test.
The details of how this was implemented have already been well described, so please refer to
the chapter about matrix-multiplication for any details about this optimisation.
The result of the optimised
#3, is now 42.89 gigaflops for a
Figure 14 - Performance of block LU
7.3.7 Further optimisation
The algorithm consists of 6 kernels covering pivoting and the three steps in block LU
decomposition. Pivoting, regular LU
optimised, the only kernel left in
will be one of two parts for an optimisation attempt. The other
lud_block_scale that
Triangular solve
The first version of triangular solve implementation
and thread block size, and even though, according to the profiling result above, an
optimisation only would
this kernel further. Being
peak performance; this would still be a good exercise
improvement of a kernel.
000
005
010
015
020
025
030
035
040
045
400
Gig
afl
op
s
Platform #1
65
The result of the optimised algorithm is shown below. The peak performance,
42.89 gigaflops for a 10,000 � 10,000 matrix, which is an increase of 1.32 tim
Performance of block LU-decomposition v3 on different platforms.
Further optimisation
The algorithm consists of 6 kernels covering pivoting and the three steps in block LU
Pivoting, regular LU-decomposition and matrix-multiplication has been
optimised, the only kernel left in its original version is lud_block_triangular_solve
will be one of two parts for an optimisation attempt. The other part is the
that still accounts for the highest GPU time consumption
The first version of triangular solve implementation was well balanced with regard
and thread block size, and even though, according to the profiling result above, an
ly would affect about 3% of the GPU running time, I chosen
Being fully aware about any optimisation would have a limited impact on
this would still be a good exercise and give insight to the analys
a kernel.
2.000 4.000 6.000
Data size
Platform #1 Platform #2 Platform #3 Platform #4
peak performance, for platform
ch is an increase of 1.32 times.
decomposition v3 on different platforms.
The algorithm consists of 6 kernels covering pivoting and the three steps in block LU-
multiplication has been
lud_block_triangular_solve, which
part is the optimised kernel
he highest GPU time consumption.
was well balanced with regards to grid
and thread block size, and even though, according to the profiling result above, an
chosen to try and optimise
aware about any optimisation would have a limited impact on
and give insight to the analysis and
6.000 10.000
Platform #4 CPU
66
The focus was on unrolling a loop (line 20 was in former version performed in separate loop)
and coalescing memory access (line 17 and 19 are now coalesced), and as expected the
optimisation did not yield any significant change in peak performance.
1. __global__ void lud_block_triangular_solve_v2(float *a, int M, int k,
int LU_BlockDim) 2. {
3. extern __shared__ float y[];
4. 5. int tid = blockIdx.x * blockDim.x + threadIdx.x;
6. int column = tid + k + LU_BlockDim;
7.
8. if (column < M) 9. {
10. for (int r = 0; r < LU_BlockDim; r++) // For each row in block 11. { 12. int rkM = r+k*M; 13. float res = a[rkM + column];
14. 15. for (int c = 0; c < r; c++) // 0<=c<r, so below diagonal 16. res -= a[rkM + c + k] * 17. y[c * LU_BlockDim + threadIdx.x]; 18. 19. y[r * LU_BlockDim + threadIdx.x] = res; 20. a[rkM + column] = res; 21. } 22. } 23. }
The result was limited as expected, but if I were to improve this kernel further, then I would
focus on testing whether Volkov’s suggestion (1 thread = 2 output) would have any positive
effect. Another improvement would focus on the global memory access in lines 13 and 16.
The elements of the lower triangular matrix of the current block being processed (L of the
orange sub-matrix) could be copied to shared memory and the reads in line 13 and 16 could
be from the faster shared memory instead of global memory.
Figure 15 – Showing the sub-matrix part of the triangular solve method.
67
Regular LU-decomposition
Focusing on the kernel that accounts for the highest GPU consumption time makes sense if
this part of the algorithm can be further optimised. The former version of the kernel focused
on hiding latency by increasing the number of thread blocks. Hiding latency can also be
achieved by using data prefetching and by minimising the need for global memory access.
The performance of the kernel lud_block_scale was attempted to be improved by using
registers to hold indices computed several times, by using data prefetching and by applying
the Volkov suggestion (1 thread = 2 outputs). Unfortunately no performance gains were
established compared to the former versions, intact a minor performance loss proved to be
the reality.
Figure 16 – A 10.000 x 10.000 matrix LU-decomposed on platform #3 and #4.
So based on this results, sometimes when applying an optimisation method, the result is
actually a performance loss. The reason for this is covered by the fact that these methods
add restrictions to the kernel, which results in extra boundary checks being needed with an
increase in flow control complexity and operations.
So implementing improvements with care followed by testing should always be exhibited to
determine whether the improvement is actually needed, to yield a better performance.
Correctness
Several tests with different data sizes were performed by the CPU as a comparison for GPU
computed results. The average deviation was 0.062963 and the maximum and minimum were
respectively 0.309113 and 0.0. The 0.0 was only achieved by the platform #4 with a Cuda
compute capability of 2.1. The deviations increased proportionally to the data size, which
20,2
20,2
20,3
20,3
20,4
v4 v5 v6
Gig
afl
op
s
Kernel editions
Platform #4
42,0
42,1
42,2
42,3
42,4
v4 v5 v6
Gig
afl
op
s
Kernel editions
Platform #3
68
make good sense. Any inaccuracy effects the result for every iterations, the larger the matrix
size, the more iterations are needed.
7.3.8 Large matrices
The test results have so far indicated that the peak performance of the block LU-
decomposition was proportional to the data size, but there must be a maximum where the
architecture limits performance.
My curiosity drove me to find this limit, so I updated the program to support very large
matrices with sizes from 400 × 400 and up to 20,000 × 20,000. A matrix of this size requires
about 1525 MB of both host and device memory, which only platform #3 matches with 4GB.
Figure 17 - Peak performance of LU-decomposition v3 on platform #3
The graph above shows the peak performance in gigaflops of different data sizes. From the
results I have added a qualified projection for matrices up to 35,000 × 35,000 in size. The result and projections show that the v3 algorithm should have a peak performance of about
51-52 gigaflops on system #3.
It is difficult to calculate the theoretical performance of the LU-decomposition block
implementation, because the computations are divided into 6 different kernels. Each kernel
has its own CGMA and its share in solving the full problem, but I will try and approximate.
The kernels with a CGMA between 0.5 and 1.0 takes up 48% of the running time, the matrix-
multiplication kernel has a CGMA of 20 and make up about 17%. The remaining kernels have a
0,0
10,0
20,0
30,0
40,0
50,0
60,0
0 5.000 10.000 15.000 20.000 25.000 30.000 35.000
Gig
afl
op
s
Data size
Platform #3
69
CGMA of about 1.0. These numbers have been retrieved from the Nvidia profiler shown in
Figure 13, and the approximated ranged result is found by these two equations.
QR �)�� , ��� 0.5 ∗ 48% + 20.0 ∗ 17% + 1.0 ∗ 35% = 3.99QR �)�� , ℎ�,ℎ 1.0 ∗ 48% + 20.0 ∗ 17% + 1.0 ∗ 35% = 4.23 ≈ 4.0
So the approximated CGMA is 4.0. The peak global memory performance for platform #3 is
102.4 GB/sec, which gives a giga single-precision data per second of 25.6 (102.4/4). The theoretical peak performance of this algorithm on this platform is 102.4 gigaflops, for fully
coalesced memory access. The actual is about half, namely 52 gigaflops.
There are several factors influencing this result, one is the fact that not all memory load and
writes are coalesced, another factor is the extra instructions processed due to control flow
complexity. But the result indicates that memory access is not a limiting factor on
performance for these kernels, but something else is.
7.4 Evaluation
LU-decomposition algorithms have given some valuable insight to some optimisation methods
that work and some that does not.
Reducing the number of kernel invocations or using asynchronous functions, can reduce the
total running. With 30,000 kernel calls, the total invocation time could be reduced from 3-4.2
seconds to 0.9-2.1 seconds. I do not think that reducing kernel calls should be a primary
focus, but just something that a developer should be aware of when implementing an
algorithm for Cuda.
To base the implementation on a block algorithm increased the performance for two reasons.
First, the problem size is reduced to pieces that can exploit faster memory types, and second
the operation matrix-multiplication is highly parallel, and had already been optimised for the
Cuda architecture.
These tests also showed that when a kernel was invoked, using several thread blocks is better
than just using one. One reason could be that the warp scheduler can utilise multiple SMs for
solving the problem.
Unrolling a loop together with Volkov’s suggestion for matrix-multiplication also helped
increase performance. The last part is a bit surprising, because Volkov’s suggestion on matrix-
multiplication actually lead to a performance decrease on system #3.
70
Other tests revealed that data prefetching and the tiling strategy did not actually increase
performance, but left it without any major change. This fact promotes the notion described
above, that memory is not the limiting factor.
The correctness tests also confirmed that instructions have a higher degree of precision on CC
v2.0 than on earlier versions.
If I were to optimising LU-decomposition even further, I would focus on arithmetic
optimisations. This could be achieved by among others focusing on unrolling loops, minimising
control flow complexity and removing unnecessary synchronisation points.
71
8 QR decomposition
QR-decomposition, also known as QR-factorisation, is a decomposition of the � × matrix A,
with� ≥ , in the form:
� = X<
Where Q is an � ×�orthogonal matrix and R is an × upper triangular matrix. An
orthogonal matrix satisfies
XYX = Z Which implies
X>� = XY There are different methods for calculating the QR factorisation, which can be used to solve
linear systems and least squares problems [23][24].
8.1 Analysis
The different methods for decomposing matrix A into a QR factorisation include Gram-
Schmidt, Householder reflections and Givens rotations.
The classic Gram- Schmidt process is considered to subject to numerical instability. The
modified Gram-Schmidt algorithm overcomes this numerical instability but at the expense of
adding extra operations [23][25], I will for these reasons not consider the classic nor the
modified version.
Operations count analysis of both Householder reflections and Givens rotations show that
Givens rotations require about 50% more operations than Householder transformation [26].
Besides that, Givens rotations rely heavily on sine and cosine instructions, which will be
processed by the limited SFU. I have therefore decided to base the QR-decomposition
algorithm on Householder transformations.
But there are also other advantages; firstly, the parallelisation is similar to LU-decomposition,
why I expect draw on the parallel optimised experiences from the LU-decomposition chapter
[24]. Secondly, Householder QR can use a compressed data storage form, by using the original
matrix A and an additional array for the diagonal values of R [23]. Consider the matrix A in
72
Figure 18, the nonzero part of the vectors �; are stored in A along with the upper triangular
matrix R.
Figure 18 - Storage strategy for the compressed Householder QR-factorisation
The diagonal of R is stored in an extra vector. If the actualXorXYis ever needed, they can be computed from this compressed representation [25].
8.1.1 The sequential algorithm
The sequential algorithm consists of 6 loops that overwrite the existing matrix with the
Householder vectors and the upper triangular matrix R. The diagonal elements of R are stored
in the array d. The elements of the matrix are stored vectorised in the array qr. Pivoting can
be used to ensure numerical stability, but has been left out for simplicity reasons.
1. // Core algorithm for QR Decomposition (Householder transformation) 2. for (unsigned int k = 0; k < n; k++) // For each column
3. { 4. // Compute 2-norm of k-th column
5. float sum = 0.0;
6. for (int r = k; r < m; r++)
7. sum += qr[r * n + k] * qr[r * n + k]; 8.
9. float nrm = sqrtf(sum);
10. 11. if (nrm != 0.0) 12. { 13. // Compute the kth Householder vector. 14. if (qr[k * n + k] < 0) 15. { 16. nrm = -nrm; 17. } 18. for (int i = k; i < m; i++) 19. { 20. qr[i * n + k] /= nrm;
73
21. } 22. qr[k * n + k] += 1.0; 23. 24. // Apply transformation to remaining columns. 25. for (int j = k + 1; j < n; j++) 26. { 27. float s = 0.0; 28. for (int i = k; i < m; i++) 29. { 30. s += qr[i * n + k] * qr[i * n + j]; 31. } 32. s = (-s) / qr[k * n + k]; 33. for (int i = k; i < m; i++) 34. { 35. qr[i * n + j] += s * qr[i * n + k]; 36. } 37. } 38. } 39. d[k] = -nrm; 40. }
This implementation is based on the Householder QR factorisation algorithm, which in the
central part has the following operation count per iteration [26]:
Dot products (Lines 7 and 30): 2 ∗ (� − .)( − .) Outer product (Lines 14-22): (� − .)( − .) Subtraction (Line 35): (� − .)( − .)
Including the outer loop, the total running time is:
?4(� − .)( − .)~@AB� 2� � − 2 �/3
This shows that the running time is O( �), and that the running time increases more than the
data size.
Pivoting can be used to increase numerical stability, but for simplicity, this has not been
included in this implementation.
8.1.2 Parallelism
QR-decomposition share the similarity of the outer loop with LU-decomposition, meaning, the
order of the outer loop is important and requires sequential processing. The algorithm can be
divided into the following tasks:
74
1. // Tasks in the algorithm for QR Decomposition 2. for (int k = 0; k < n; k++) // For each column
3. { 4. // Task 1: Compute 2-norm of k-th column
5.
6. // Task 2: Compute the kth Householder vector. 7.
8. // Task 3: Apply transformation to remaining columns.
9. }
Each of the tasks above can be performed with varying parallel degree. So there is a parallel
potential for this algorithm, and the details will be described later.
8.2 Simple algorithm
A simple version of QR-decomposition was implemented and tested. The goal is to port the
algorithm to the Cuda architecture as fast as possible. Later, when the implementation is
functional, I will look at how to increase performance.
8.2.1 The algorithm
The sequential algorithm shown in paragraph 8.1.1 was implemented using regular C++, to
target the CPU architecture. This implementation was used to in the correctness test.
The GPU accelerated simple version was implemented based on the analysis from paragraph
8.1.2, and using the three identified tasks. To make the implementation of the three tasks as
easy as possible, the same procedure is used for all tasks, namely each task is handled by a
single thread block that holds 128 threads. The drawback is that when the problem size
becomes smaller than the number of threads, which happens when. < � − 128for task 1 and 2, there are generated a number of empty threads. This will only have an effect when
the last rows and columns are being processed, and for large matrices this will constitute a
relative small amount of the total running time.
Task 1 - Two-norm
The data size for each.]^step is% = � − ., which is the number of remaining rows. In the
sequential implementation, the running time is proportional to%. In this version, one thread block with 128 threads processes any%size, meaning the running time is proportional
to%/128. Task 2 - Householder vector
Each element in the Householder vector can be calculated independently, and this task
processes the remaining rows. The data size for each.]^step is% = − .. The sequential
75
implementation has a running time proportional to%, while this version as task 1, has a
running time proportional to d/128.
Task 3 – Transform columns
When task 1 and 2 have been performed, the rest of the matrix must be updated, which is the
remaining columns and rows. Each column can be generated independently, and each thread
of the 128, processes for each.]^step( − .) 1282 columns.
8.2.2 Test and results
Tests were performed with data sizes 400, 2000, 4000 and 6000, data sizes that are both
height and width of the matrix being decomposed. Matrix elements are randomly generated.
All systems were used in the tests and the CPU was used to calculate a reference result. The
CPU results indicates that the CPU is slower when the matrix gets bigger, this makes good
sense as the CPU for large problem sizes is not able to exploit its caches. The performance of
the GPU gets better when the matrix size increases up to about 2000, then the processing
power performance evens out. System #3 is the only architecture that performs better than
the CPU, so this specific implementation does not yield an acceptable performance on the
GPU.
The maximum difference from the CPU reference result was 0.000049 so the implementation
is considered acceptable accurate.
0,00
0,10
0,20
0,30
0,40
0,50
0,60
0,70
0 1000 2000 3000 4000 5000 6000
Gig
afl
op
s
Data size
System #1 System #2 System #3 System #4 CPU
76
GPU.NET
QR-decomposition is not tested with GPU.NET for the same reasons as described in the LU-
decomposition chapter.
8.3 Optimisation
Task 1 - Two-norm
This task can be improved by using a parallel reduction algorithm. Doing so, make it possible
to decrease the asymptotical running time to O(log&%*). Task 2 - Householder vector
Each.]^step has a data size of% = − ., which the running time of sequential
implementation is proportional to. Each element in the Householder vector can be calculated
independently, making the asymptotical parallel running time about%/C, whereCis the number of processors. The maximum possible number of processes is%, making the
asymptotical parallel running time O(1) if a min. of − 1processes is available, if not the asymptotical running time is O(%). Task 3 – Transform columns
The number of operations required to update the remaining columns and rows of the matrix
equals:
2 ∗ % ∗ &� − .*Where%represents the number of columns − .that can be processed in parallel. So the parallel running time is:
2 ∗ %C ∗ &� − .*Where C is the number of processors. If a min. of processors are available the running time
for each step is proportional to 2 ∗ &� − .*, and� − .is the number of rows.
8.3.1 Test and results
With these optimisations implemented, the tests were performed again for400 × 400, 2000 × 2000, 4000 × 4000 and 6000 × 6000 matrices.
77
The optimisations have increased performance on all systems. Matrices larger than
approximately1000 × 1000, now benefits from being computed using the Cuda architecture.
The peak performance for system #3 reached 1.42 gigaflops, not as impressive as the results
achieved by matrix-multiplication and LU-decomposition, but still about 5 times as fast as the
CPU. One of the reasons for this low performance is because the algorithm is not that suited
for a parallel architecture.
8.4 Block QR-decomposition
The algorithm used so far, relied heavily on vector operations. Matrix operations are much
better to exploit a parallel architecture, and such an algorithm has been designed by Susan
Ostrouchov et al [27].
8.4.1 The block algorithm
By partitioning aF × _matrix A, the factorisation QR may be partitioned as shown [27]:
Where� × �is the block size, �� is anF × �matrix containing the first�columns, and ��is an� × (_ − �) matrix containing the remaining columns. ��� is� × �, ���is� × (_ − �), ��� is (F − �) × � and ���is(F − �) × (_ − �).
0,00
0,20
0,40
0,60
0,80
1,00
1,20
1,40
1,60
0 1000 2000 3000 4000 5000 6000
Gig
afl
op
s
Data size
System #1 System #2 System #3 System #4 CPU
78
�� = `������a � % �� = `������a The first step is to perform a regular QR factorisation on �� using a series of Householder
transformations of the form:
b; = Z − c;d;d;Y Where� = 1, … , �. The vectord; is of lengthFwhere the first� − 1elements is 0, and
the�]^ element is 1.
c; = 2 &d;Yd;*2
It can be shown that:
X = b�b�…b = Z − fgfY So in step 2, the triangular factorgof the block reflectorXis calculated. The triangular factor is used together with the transformation above, to update the remaining matrix.
In the F/� number of steps the matrix A has been decomposed by using a block QR-
decomposition, as depicted here. The white parts have already been solved.
Figure 19 - Matrix A being decompose by block QR-decomposition in steps.
79
Figure 19 shows theF × _matrix, . is the current step and block width is obviously the
dimension of the current block being processed (here the green and purple sub-matrix). Step
1 QR-decomposes the green and purple sub-matrix by regular QR-decomposition, then the
triangular factor is is used to transform the remaining columns in the cyan and blue sub-
matrix. The steps are then continued for the remaining parts until the whole M x M matrix is
processed.
This algorithm makes it possible to partition large matrices and solve smaller parts,
furthermore matrix operations are being used, that can utilise the parallel Cuda architecture.
8.4.2 Implementation
Implementation of this block algorithm has been challenging. The structure of the algorithm
resembles the block LU-decomposition, which was implemented and performed well. This
made me hope that the block QR algorithm also could be implemented and perform well.
Unfortunately this has not been the case.
Implementation of the algorithm was initially attempted for CPU processing. Thorough
debugging and testing have revealed that most of the algrithm works, and generates the
expected result. Regrettable, not all parts work as hoped. The poblematic part is related to
this transformation:
X = b�b�…b = Z − fgfY This transformation can be divided into three steps. Using explanation and figures from
above:
h ← fY�� This transformation is a matrix-multiplication between the transposed block of Householder
vectors and the sub-matrix��, consisting of the remaining columns in the matrix A. The result
is then written to the� × &_ − �*matrixh.
h ← gYh
Then the triangular gfactor of the block reflectorX, should be computed. After which its
transpose should be matrix-multiplied with the existing matrix h.
�j� ← `<��k<��k a = �� − fh
80
When the final matrix W is computed, it is used in a matrix-multiplication with the block
Householder vectors. The elements are then subtracted from the sub-matrix ��, which should
give ��k.
These steps should then be repeated until the complete matrix has been decomposed.
But numerous attempts at calculating the triangular g factor has failed, and the resulting matrix and array contains wrong values.8.5 Evaluation
The algorithms for LU- and QR-decomposition have a similar structure, so ideas from the LU
implementation was also applied to the QR implementations.
Optimising the running time of the different tasks proved to increase performance with a
factor of 3.84 times.
Unfortunately, due to the lacking implementation of a block QR algorithm, no further tests
were performance. This means that the full potential of QR on the Cuda architecture is still
to be unfolded.
Jack Dongarra, Susan Ostrouchov and others have designed this block QR algorithm. They are
highly competent people that have made contributions to Eispack, Linpack, BLAS, Lapack and
ScaLapack. The challenge with thegfactor is more than likely related to my implementation
rather than the algorithm.
81
9 Evaluation
The optimisation strategy described some methods and techniques that could be applied
when improving the implementation of the linear algebra algorithm. This evaluation
paragraph will summarise the findings and evaluate on the strategy.
9.1 Cuda
The Nvidia profiler can show relevant counters for both arithmetic and memory performance.
CGMA source code analysis can give valuable information about memory bandwidth as a
limiting factor.
The results from the tests suggest that a block linear algorithm is best suited for the Cuda
architecture. Such an algorithm is designed to divide data into sizes that fit into caches, such
as shared memory.
When implementations are to be optimised, the findings from this project suggest that tiling
is the best strategy, followed by latency hiding and coalescing memory access.
With regards to coalescing memory access, it should be mentioned that GPU architecture
designers are aware of the importance of this limiting factor, so newer GPUs are designed
with built-in optimised memory access. The impact of non-coalesced memory access should
therefore be of less importance in the future, and hence make porting of existing algorithms
easier.
In addition to the points above, here is a list with recommendations based on the findings of
the tests performed in this project:
• Avoid using structures as parameters in the kernel definitions, use instead simple
types or pointers thereof.
• Target the highest possible Compute Capability level. Among other things, the
precision of instructions are better and the result will be more accurate.
• Unroll loops, by making the threads fine-grained. Generation and thread scheduling
are cheap.
• Thread block size should be a multiple of the warp size (Currently 32).
• Be aware of the overhead for invoking a kernel.
• Note that default instructions deviate from IEEE 754, use specific IEEE 754 functions
for increased precision, but at the cost of speed.
Besides the list and suggestions above, there were also methods with doubtful results:
82
• The Volkov suggestion yielded performance gains on some systems, but lower on
others. Can be useful for low occupancy kernels, but should be tested and evaluated.
• Data prefetching can both increase and lower performance.
The underlying hardware and its capabilities play an important role whether an optimisation
technique affect performance. Some methods have positive effect on some GPUs, and a
negative on others. Analysing and testing should therefore always be performed.
9.2 GPU.NET
GPU.NET v1.0.3.5 was not mature and suffered from several bugs. The number of problems
makes it not recommendable for production environments. However, the latest release is
v2.0.14, which solves many of the bugs and problems I encountered.
The JIT compilation of kernels is a design decision that applies to all current versions of
GPU.NET. A JIT compilation is cached in-memory, and subsequent calls from the same process
will be served from this cache. It is therefore recommended, when using GPU.NET for large or
numerous problems, to warm-up both Cuda and GPU.NET. Do this by calling the kernel with a
small data size, subsequent calls will then be served faster.
83
10 Discussion and future work
This chapter begins with a discussion of the work and results in this project, and which fields
could be further researched. Then, the future of Cuda is discussed in comparison with GPU
code generation tools. After which a broader perspective on hardware development and
GPGPU in general, is discussed.
10.1 Project
A more thorough correctness test and analysis could further clarify the numerical stability of
the implementations used in this project. For example by comparing this projects results with
results from the widely recognised Matlab.
For this project I insisted on implementing all parts of the algorithms. A lot of work and
research have gone into the development of standard math libraries supporting the BLAS
interface. Implementing all parts gave valuable insight to the inner workings of the
algorithms, but possibly at the expense of performance. Using these libraries (e.g. Cublas or
Cula), could reveal the full performance potential of the different algorithms on the Cuda
architecture.
Testing performance, of other linear algebra algorithms, could serve as a frame of reference.
For example, how would Givens rotations affect the performance of QR-decomposition
instead of the Householder transformation method chosen? A more thorough analysis and
testing of the QR block algorithm would also be beneficial.
The optimisation strategy and the optimisation experiences could be applied on several other
linear algebra algorithms. An obvious extension would be the Singular Value Decomposition
(SVD).
10.2 Cuda
Cuda C and GPU.NET currently represents two different directions for utilising the Cuda
architecture. Cuda C is C/C++ and complex, whereas GPU.NET is .NET, uses code generation,
and is easier to use. You might say that GPU.NET is for developers that without too much
trouble, wants to accelerate their applications using parallel architectures. Cuda C, on the
other hand, is for developers that are not intimidated by C/C++ and tweaking.
Cuda C offers more flexibility, which enables better optimisation and higher performance,
but it does however not have to be a choice of either advantages. Cudafy.NET is a set of
libraries and tools supporting both directions.
84
Cudafy.NET can be used in the same way as GPU.NET, using full code generation. But it can
also just work as a bridge from .NET to Cuda C kernels. Cuda C optimisations are then
possible, while the invocation is carried out by the .NET runtime.
Uncovering the performance characteristics of Cudafy.NET e.g. using the linear algebra
algorithms from this project could be another valuable next step.
It is expected that Cuda will continuously be improved, e.g. by making the NVCC support C++
language features in kernels, allow better debugging in Nsight, and increase the language
support features in IDEs, to make development smoother.
With Cuda v4.0 the tools and drivers has been updated, and now enable a grid of machines
and GPUs to work together to solve large problems. This makes Cuda able to solve even larger
problems, than with former versions.
10.3 Hardware
The newer Cuda GPUs are becoming increasingly accurate, meaning the instructions are
performed with better numerical precision, at even faster speeds. Double-precision
instructions have been supported from Compute Capability 1.3, and it is expected this as
well, will become more precision together with faster processing times.
The future will surely also bring GPUs with even more cores and faster memory. Currently the
architecture of the Nvidia GF100 chips support up to 512 cores, but the dedicated GPU
computing system Tesla S2050 have 4 GPU with a total of 1,792 cores. Nvidia is not only
player when it comes to GPGPU. AMD has the FireStream architecture, and the top model
FireStream 9370 has 1,600 cores delivering 2,640 gigaflops.
Looking at the latest “TOP500 supercomputers list”, out of the top 5 the 3 are using Nvidia
GPUs. So Nvidia is a strong player, and I expect Nvidia and Cuda to play an important role in
the GPGPU field in the future.
10.4 Future of GPGPU
GPGPU development has for a long time been limited to first movers that saw a potential in
the high processing power that GPUs offer. Currently GPGPU is often used, where many
computations are needed. For example simulations of fluid or weather forecasting, or the
prediction of protein folding used by the pharmaceutical industry. But another and more
subtle application is slowly emerging.
85
A graphics card with a high performing GPU is a relatively cheap commodity, and many
regular computer systems are today equipped with a high performing GPU. Some application
developers have spotted this opportunity and now allow their application to be optionally
accelerated by the GPU. This is often completely transparent to the end user, but delivers an
increased application response time, which gives the user a better experience.
Applications that currently exploit this possibility includes, but are not limited to, different
browsers, such as Internet Explorer, Chrome and Firefox, and different video editing
applications.
86
11 Conclusion
Three frequently used linear algebra algorithms for matrix-multiplication, LU- and QR
decomposition was decided on for this project. They were described, analysed, and then
initially implemented using C/C++ for the CPU architecture.
The Cuda architecture and development platform was subsequently analysed and described.
Important features, characteristics and limitations were uncovered and an optimisation
strategy was formed.
Based on the analysis of the linear algebra algorithms and Cuda, implementation procedures
were designed. Then the algorithms were implemented targeting the Cuda architecture and
using C/C++ and Cuda C, after which they were tested. During this process different findings
were learned, which was subsequently used in combination with the Cuda optimisation
strategy to improve performance.
GPU.NET was used, where applicable, as a perspective on how to use Cuda from .NET.
Correctness tests were performed by comparing the results from the CPU with the results
from the GPU. The maximum differences documented the accuracy of the different
algorithms processed on various systems and GPUs.
The learning goals have all been achieved and the complete process has been documented in
this report.
87
Bibliography and references
1. Mikkel Bundgaard-Ovesen, Documentation of the GPUs usability in advanced parallel
calculations, 15th December 2010
2. Nvidia Cuda, Nvidia Cuda C Programming Guide, 9. November 2010
3. Desmond Fearnley-Sander, Hermann Grassmann and the creation of linear algebra,
December 1979
4. David B. Kirk and Wen-mei W. Hwu, Programming Massively Parallel Processors, 2010
5. Jason Sanders and Edward Kandrot, Cuda by example – an introduction to General-
Purpose GPU Programming, 2011
6. Vasily Volkov, Better Performance at Lower Occupancy (slides), 22nd September 2010
7. Vasily Volkov, Use registers and multiple outputs per thread on GPU (slides), 30th
June 2010
8. Geekbench, Performance of an Intel Pentium 4 3.06 GHz running Linux, Downloaded
3rd June 2011 (http://browse.geekbench.ca/geekbench2/view/209683)
9. Nvidia, Cuda C Best practices Guide, 20th September 2010
10. Paulius Micikevicius (Nvidia), Analysis-Driven Optimization (slides), 14th November
2010
11. Sara Robinson, Toward an Optimal Algorithm for Matrix Multiplication, November
2005
12. Ananth Grama et al., Introduction to Parallel Computing, 2nd edition, 26th January
2003
13. Mary Jane Sterling, Linear Algebra for Dummies, 2009
14. Brian W. Kernighan and Dennis M. Ritchie, C Programming Language, 2nd edition, 1st
April 1988
88
15. John J. Barton and Lee R. Nacnman, Scientific and Engineering C++: An Introduction
With Advanced Techniques and Examples, 19th August 1994
16. Jens Eising, Lineær Algebra, 1999
17. G. W. Stewart, Afternotes on Numerical Analysis, 1996
18. E. E. Santos and M. Muraleetharan, Analysis and Implementation of Parallel LU-
Decomposition with Different Data Layouts, June 2000
19. Prof. Michael T. Heath, Parallel Numeric Algorithms: LU-Decomposition (slides), 2010
20. Vasily Volkov and James W. Demmel, Benchmarking GPUs to Tune Dense Linear
Algebra, November 2008
21. Vasily Volkov and James W. Demmel, LU, QR and Cholesky Factorisations using Vector
Capabilities of GPUs, 2008
22. Jack Dongarra et al., Derivation of a Block Algorithm for LU Factorization, 9th
February 1997
23. Peter J. Olver, Orthogonal Bases and the QR Algorithm, 5th June 2010
24. Prof. Michael T. Heath, Parallel Numerical Algorithms: QR-Factorization (slides),
2010
25. Walter Gander, Algorithms for the QR-Decomposition, April 1980
26. Radu Trîmbitas, Householder Reflectors and Givens Rotations: Why orthogonality is
fine, 11th March 2009
27. Susan Ostrouchov, QR Factorization (a block algorithm), 28th April 1995
89
Appendix A – Project evaluation
The initial problem definition about linear algebra algorithms was updated during the project
period. In agreement with my supervisor Peter Sestoft, we decided to focus on linear algebra
algorithms for matrix-multiplication, and LU- and QR-decomposition. This clarification made
me able to focus on analysing Cuda features and limitations. My assessment is that this
elucidation made it possible to uncover Cuda characteristics in a details, that else was not
possible.
I am satisfied with the result of the project. The problem definition and learning goals were
all fully met, and the process and findings are all described in the report. But it was not
everything that went without challenges, let me elaborate.
To be able to implement an algorithm, a full comprehension of the algorithms and its inner
workings is necessary, this showed to be severely complicated with regards to LU- and QR-
decomposition. In addition to the linear algebra complications, add the difficulty of using a
new development architecture and programming language.
I can best describe this by comparing it to building a house of cards. The implementation
phase is represented by the last third of the house. So before being able to build the top of
the house, one needs to build the first 2/3, and before that, one needs to determine where
the house should be based.
My initial lack of knowledge of linear algebra, C and C++ meant that many resources were
invested in learning and gaining abilities. In spite of the initial research phase, I did
encounter situations during the project period, where my knowledge still did not suffice. As
mentioned above, this applied specifically to LU- and QR-decomposition. With QR it was
specifically the block algorithm that was difficult to comprehend.
During the 6 months that the project period lasted, I did learn a great deal. Learning goals
covering linear algebra and algorithms, together with Cuda, C and C++, defined the areas in
which I wanted extent my knowledge. As mentioned, I had only minor experience and no
qualifications in the fields prior to this project. So the learning requirement was high, and the
learning curve was steep, but I am satisfied with the result and the knowledge I have gained
will be beneficial in the future.
90
Appendix B – Implementation considerations
When the Cuda platform is utilised for processing, a new computing environment is
introduced into development and runtime. The host refers to the code and memory of the
CPU and the device refers to the code and memory of the GPU. The code and functionality
that exhibit little or no data parallelism are implemented in host code. The code and
functionality that exhibit rich amount of data parallelism are implemented in the device code
[4].
The host and device are two runtime environments that work independently. Communication
between the host and device is obviously necessary, as else, the CPU would not be able to
harnessing the GPU power of the Cuda architecture. In Cuda, the host is responsible for this
communication, which includes structuring data, allocate and releasing memory on device,
copy data to and from device as well as invoking the device kernel.
In addition to this, the host is responsible for configuring the device execution environmental
settings. Basically, specifying the number of threads the architecture should spawn to solve a
problem. The Cuda architecture allows, as shown in the following figure, threads to be
organised in blocks, and blocks to be organised in a grid.
Figure 20 - Cuda thread organisation [4]
91
Cuda thread organisation
A kernel is mapped to a grid, which is organised by blocks in two dimensions and a block can
hold threads in three dimensions. In the device kernel a block and thread is identified by the
following built-in variables:
Variable Description
gridDim.x Holds the number of blocks in the first dimension of the grid. Values are
valid in the range 1-65535.
gridDim.y Holds the number of blocks in the second dimension of the grid. Values are
valid in the range 1-65535.
blockDim.x Holds the number of threads in the first dimension of the block. Values are
valid in the range 1-512.
blockDim.y Holds the number of threads in the second dimension of the block. Values
are valid in the range 1-512.
blockDim.z Holds the number of threads in the second dimension of the block. Values
are valid in the range 1-64.
blockIdx.x Hold the current blocks first dimension position in the grid. Values are valid
in the range 1-[gridDim.x].
blockIdx.y Hold the current blocks second dimension position in the current grid.
Values are valid in the range 1-[gridDim.y].
threadIdx.x Hold the current threads first dimension position in the current block.
Values are valid in the range 1-[blockDim.x].
threadIdx.y Hold the current threads second dimension position in the current block.
Values are valid in the range 1-[blockDim.y].
threadIdx.z Hold the current threads third dimension position in the current block.
Values are valid in the range 1-[blockDim.z].
Table 10 - Cuda built-in variables
Why has Nvidia designed a thread structure in up to five dimensions? Would it not be easier to
just use a single dimension?
92
For simple algorithms that only require a thread structure in one dimension, this can be
achieved. But there exists problems that naturally belong to a space of two dimensions or
more, e.g. a matrix. This structure is optional only, meaning the developer, and some
hardware limitations, decides how many dimensions to be used.
The total number of threads is a result of the following:
gℎ���%� = ,��%l��. 5 ∗ ,��%l��. 8 ∗ $���.l��. 5 ∗ $���.l��. 8 ∗ $���.l��. � Where $���.l��. 5 ∗ $���.l��. 8 ∗ $���.l��. � cannot exceed the total number of threads per
block GPU constraint. This is for most GPU’s 512. [5].
The size of the grid and blocks is often defined directly in the source code, but the optimal
size is in many cases directly dependent on the data size. This is not very flexible, as it means
that grid and block size would have to be adjusted, in the source code, for different data
sizes, and afterwards recompiled before execution.
There are different solutions to this. One way is to set the number high to cover most cases.
In the kernel one would have to check if the current thread actually has data to process like
so in line 6:
1. __global__ void kernel(float *data, int dataSize) { 2.
3. // Thread ID
4. int tid = threadIdx.x + blockIdx.x * blockDim.x; 5.
6. if (tid < dataSize) {
7. 8. // Process data
9. }
10. }
This is inefficient as many threads will be spawned but without any actual data to process.
Another way is to define the number of threads per block (e.g. 128), and then calculate the
number of required blocks from the data size. This makes sure that at most (threadsPerBlock-
1) threads are created without any data to process.
A third way is to calculate the grid and block size dynamically from the data size; this is
however difficult as the optimal setting is influenced by both the data size and the structure
of the algorithm.
93
Either of the second or third method can prove feasible, they both have pros and cons, but
which specific method to use, should be determined on a case by case basis.
SIMT and warp size
As mentioned earlier, threads are organised in blocks. But this is not the only organisation;
each block is partitioned into warps. A warp is a bundle of 32 threads being executed in
parallel.
These threads share a single instruction set, hence Cuda is a Single Instructions Multiple
Threads, also abbreviated SIMT, architecture. This is a design decision to reduce hardware
cost and to enable optimisations techniques, and it is not without relevance to the developer.
The SIMT architecture has some implications that will be discussed later.
The size of the warp is another important aspect to take into account when defining the grid
and block size. Consider the example where a problem is organised into 20 blocks each with
10 threads, giving a total of 20 x 10 = 200 threads. Cuda executes 32 threads in a warp in
parallel. In the example above, only 10 threads are available per block. Cuda will in this case
fill up the warp with 22 empty threads, resulting in 20 x 22 = 440 empty threads being
created. The block size should theoretically be defined to a number dividable by 32 [4].
Elapsed time
Measuring elapsed time is essential to measuring performance. Normal event timing in C and
C++ is CPU based, which is insufficient when dealing with the GPU. The GPU and CPU are
physically two independent processors, which run in parallel. The Cuda toolkit provides an API
for measuring GPU events and elapsed time.
The Cuda API will be used to measure memory allocation, copy of data from host to device,
the kernel execution time, copy of data from device to host and the release of memory.
These different timers will not just give the elapsed times of different operations, but actual
valued insight to the GPU performance. It will for instance be possible to calculate memory
transfer rates as well actual peak performance in gigaflops of the kernel.
In addition to valued insight, the timings can be used to measure relative performance gains
or losses, when certain properties or capabilities of the Cuda architecture have been applied
to the algorithms. In addition to measuring relative performance, the GPU timing will serve as
a base for comparison with the similar linear algebra processes on GPU.NET and the CPU.
94
Pinned or page-locked memory
A program that uses Cuda to harness the power of the GPU normally follows these steps:
1. Initialise
2. Copy date from host -> device
3. Process data on device
4. Copy data from device -> host
5. Release
The kernel has been the focus for optimisations and analysis so far, but there are other ways
of optimisation a program using Cuda. By using pinned or page-locked memory, higher data
transfers can be achieved between host and device.
On platform #1 the speed of memory transfer could be increased from about 1.5 GB/sec to 5
GB/sec. But caution should be exercised when pinned memory is used; excessive use can
reduce overall system performance as page-locked memory is scarce [9].
Matrix structure
Matrices are mathematical structure in two dimensions. In the computer memory this can
either be represented by 2-dimensional array or an array of arrays. Even though 2-
dimensional structures are available in computer memory, it is better to vectorise the matrix,
by aligning the rows after each other. Accessing a specific value ��� in the vector of matrix A,
is performed like so: v[3 * Width + 2]. Where v is the vector of matrix A, and Width is the
column count of A. The Cuda architecture is designed to be stream based, so by vectorising
data for processing on the GPU platform, one uses Cuda as it was designed and intended.
For the code I use the following matrix structure to hold the vector and details about the
matrix.
1. typedef struct
2. { 3. float *n;
4. unsigned int width;
5. unsigned int height; 6. unsigned int size;
7. } matrix;
n is the pointer to the vector of float values, width is the number of columns, height the
number of rows and lastly the size if the length of the vector (height*width).
95
Appendix C – Hardware specification
description and analysis
In the following the specifications of the different machines will be described and evaluated
in terms of Cuda capabilities. The speed of the GPU and memory are measured and from that,
the memory bandwidth and gigaflops are calculated.
The GPU has historically been designed for performance and not precision, hence all gigaflops
calculations are based on single precision float operations. It is not until Cuda compute
capability (CC) 1.3 that double precision were supported, but with a significant performance
hit.
The following specifications are based on information and measurements by CPU-Z, GPU-Z
and Cpu Caps Viewer, as well as information about bus, FSB, PCI-E and more. The details are
meant to give a theoretical upper limit on performance, which can be used for comparison
with the results of the tests.
Platform #1
Apple Macbook 13” with Intel Core 2 Duo P8700 2,53 MHz processor, 4GB DDR3 ram on 533
MHz, a Nvidia GeForce 9400m and a Front-side-bus (FSB) on 1066 MHz. The GPU on the
machine has the following specifications:
Cores 16
Memory interface 128-bit
Memory bandwidth (internal/external) 8GB/sec, 16,6 GB/sec
Graphics bus interface (PCI-E v2.0) 8 GB/sec
Transistors 282 Million
Core clock 450 MHz
Shader Clock 1100 MHz
Memory Clock 1066 MHz (533 MHz double pumped)
Gigaflops 51,56
96
Cuda Compute Capability 1.1
Table 11 - GPU specifications for Nvidia GeForce 9400m, platform #1
Platform #2
Apple iMac 24” with Intel Core 2 Duo E8435 3.06 GHz, 4 GB DDR2 ram on 399 MHz, a Nvidia
GeForce 8800 GS and FSB on 1066 MHz. The GPU on the machine has the following
specifications:
Cores 96
Memory interface 256-bit
Memory bandwidth (internal) 49,94 GB/sec
Memory bandwidth (external) 6,23 GB/sec
Graphics bus interface (PCI-E v1.1) 8 GB/sec
Transistors 754 million
Core clock 500 MHz
Shader Clock 1250 MHz
Memory Clock 800 MHz
Gigaflops 234,38
Cuda Compute Capability 1.1
Table 12 - GPU specifications for Nvidia GeForce 8800 GS, platform #2
Platform #3
Is a machine with a Nvidia Tesla C1060 GPU. The exact machine specifications have not been
available, however the specifications for the C1060 GPU gives some hints on the performance.
Cores 240
97
Memory interface 512-bit
Memory bandwidth (internal) 102,4 GB/sec
Transistors 754 million
Core clock 602 MHz
Shader Clock 1300 MHz
Memory Clock 1600 MHz
Gigaflops 933,12 for Total(Mul+Add+Special Function)
622,08 for Total(Mul+Add)
Cuda Compute Capability 1.3
Table 13 - GPU specification for Nvidia Tesla C1060, platform #3
Platform evaluation
You may wonder why platform #3 has two different gigaflops. The first is based on the
specifications of the G80 and the descending architectures, which says that a GPU is capable
of performing a Multiply-Add instruction dual-issued with a special function instruction per
operation cycle. The second is based on the newer Fermi architecture specifications, in which
a operation cycle can perform a Multiply-Add instruction dual-issued.
That a newer architecture supposedly is slower than an older one, not only contradicts the
logic of development and improvement, but it is not so. The G80 based architectures are
equipped with streaming processors (SP) and separate special function units (SFU). The SP
combined with the SFU gives theoretically 3 operations per clock cycle; however basing a
gigaflops calculation on these specifications makes the result very theoretical. Calculating the
gigaflops performance according to this may be correct, but does not yield an achievable
result. The reason is surely a result of Nvidias competition with other GPU manufactures, to
produce a GPU with the highest gigaflops count.
Most development and testing are performed on platform #1, so this platform will serve as a
base.
98
Specifications
This paragraph will dig a little deeper into the specifications of the hardware, and describe
the theoretical performance limits. When dealing with GPU’s the most important are memory
transfer rates and GPU gigaflops. The relevant elements in question are chipset, front-side-
bus (FSB), memory speeds and the GPU.
Figure 21 – Block diagram of a chipset. Source: Intel
Chipset
The chipset consists of a north- and a south bridge. The north bridge is responsible for
handling the exchange of data between the CPU, memory and the graphics adapter. The
south bridge handles exchange of data with external devices like audio, network, hard discs
and USB devices. The north bridge is the most data intensive and relevant for this project,
whereas the south bridge is not in used for GPU accelerated applications.
The bus speed of platform #1 is 266 MHz with a multiplier of 4, making the rated FSB about
1066 MHz. The width of the bus is 64-bit making the transfer rate:
99
g�� �m����)�&n�/�* = o6� ∗ � ���%)ℎ8 ∗ 1024
The transfer rate is then 8.33 GB/s.
Figure 22 - CPU and bus details of platform #1
Memory
Memory transfer occurs when data is copied from host to device and again when the result is
copied from device to host. This data is transferred via the chipsets north bridge from the
CPU/system memory to the device memory. The GPU of Platform #1 has no dedicated
memory and uses the system memory. The transfer rate is of that reason equal to that of the
system memory.
The system memory consists of two DDR3 modules whose peak transfer rate is double that of
the FSB (double data rate), meaning 16.66 GB/s.
100
Figure 23 - System memory details of platform #1
Grahpics adapter
The software GPU-Z reports the GPU of platform #1 to run on a PCI port. But this cannot be
true as the transfer rate would be about 2 GB/sec. My guess is, as the Nvidia specification
says, it runs on a PCI Express 2.0 bus interface with a peak transfer rate of 8 GB/s one way,
which by the way is the same as the memory transfer.
101
Figure 24 - GPU details of platform #1
The gigaflops count is calculated by:
n�,�m���� = nC4����� ∗ 6���%� Fb� ∗ =����)�� �����8���1024
Platform #1 has 16 cores with a shader speed of 1100 MHz. The number of operations are
theoretically 3 (Mul+Add+SF), which in terms result in a gigaflops count of 51.56.
Evaluation
Development and testing will mainly be performed on platform #1 and #2, even though they
lack the extreme computing power platform #3 posses. The purpose of this project is to test
the applicability of GPGPU for solving different problems, and the focus is furthermore on
testing relative performance gains or losses of different optimisations techniques.
Platform #3 will however give an important insight into the performance of solving these
problems on a massive parallel architecture. Platform #3 furthermore supports a higher
compute capability, which makes even more optimisation techniques available, as well as
double precision operations.
102
Appendix D – Development environment
problems and solution model
Making VS2008 ready for Cuda development was a challenge. The following is a description of
the problems experience, and the solution used.
Development model
Cuda toolkit version 3.2 supports Microsoft Visual Studio 2005 (VS2005) and Visual Studio 2008
(VS2008). It is possible to enable development in Visual Studio 2010 (VS2010), but has proven
difficult to setup. This is among other reasons, due to the fact that the Nvidia Cuda compiler
(NVCC) requires either a Visual C++ version 8 or 9 compiler.
I have tried to set Cuda up for VS2010, but the trouble have led me to the conclusion, that
the problems and minor inconveniences far exceed any gains achieved by using VS2010.
The Nvidia GPU computing SDK, a separate package, provides help, tutorials, utility helpers
as well as code examples. With this package, all the hard work of configuring Visual Studio,
setting up paths and environment variables are done for you. However with it follows libraries
packed with utility and helper functions and references to other libraries.
Performance is important in this project, and there is no possibility to say what impact any
reference libraries or any utility functions might have, which is why I have decided to create
a new and clean project model, that can serve as a base for the performance tests in Cuda.
By doing so, I get valuable insight of the structure of the toolkit and its applicability.
Cuda C and C++
Cuda code however has to be written in Cuda C, a language based on ANSI C. Host code on
the other hand does not necessarily need to.
The language C has evolved since its initial release, and C++ provides new features and
updated libraries. It can therefore be desirable to use a mix of C and C++ when coding for the
host, while device code must be written en C.
In the project model C++ code must be contained in .cpp files and C for Cuda in .cu files.
With the correct configuration it is possible to make NVCC compile .cu files and Visual C++ 9.0
compile the rest. The compilation linker’s responsibility is then to link the compiled objects
and functions into a single executable file.
103
A description of what steps I had to take, and the project model can be found on my blog:
http://blog.ovesens.net/2011/05/cuda-v3-2-template-project-using-cpp/
104
Appendix E – CGMA and Cuda profiler
Being able to optimise requires event measuring or profiling. But event measuring or profiling
required knowledge of what to profile. The Nvidia paper on “Analysis Driven Optimization”
[10] identifies four categories of what can limit a kernels performance; memory throughput,
instruction throughput, latency or a combination of the above.
To achieve the best performance it is important to strike a perfect balance between
instructions:bytes ratio, also called compute to global memory access (CGMA) [4]. Two
methods should be applied when trying to determine any optimisation possibilities, source
code analysis and tool profiling.
By looking at the source code, the developer can analyze whether a kernel is memory or
instruction bound, and whether the ratio between these two is limiting the performance.
To measure events, Nvidia provides a tool called “Compute Visual Profiler” that provide
different counters. Different profile counters are available for GPU’s of different CC levels, a
complete list and description can be found in the “Compute Visual Profiler User Guide”.
The higher the compute capability the more detailed and accurate counters, but that does
not mean that this projects development hardware, with compute capability of 1.1, does not
provide any counters with valuable insight. These are shown and described in the following
table.
Counter Description
divergent branch Number of divergent branches within a warp. This counter is
incremented by one if at least one thread in a warp diverges
(that is, follows a different execution path) via a data
dependent conditional branch. The counter is incremented by
one at each point of divergence in a warp.
instructions Number of instructions executed.
gld uncoalesced Number of non-coalesced global memory loads. Number of non-
coalesced global memory loads.
gld coalesced Number of coalesced global memory loads.
gst coalesced Number of coalesced global memory stores.
105
local load Number of local memory load transactions. Each local load
request will generate one transaction irrespective of the size of
the transaction.
local store Number of local memory store transactions; incremented by 2
for each 32-byte transaction, by 4 for each 64-byte transaction
and by 8 for each 128-byte transaction for compute devices
having compute capability 1.x. It is incremented by 1
irrespective of the size of the transaction for compute devices
having compute capability 2.0.
Table 14 - Selected profile counter from Compute Visual Profiler User Guide
These profile counters can give valuable insight to what a kernel actually do, but they cannot
be used without consideration, Nvidia writes the following:
Compute Visual Profiler values are best used to identify relative performance
differences between un-optimized and optimized code.
But holding the profiled numbers together with analysed numbers presents a good estimate of
how much bandwidth is wasted by suboptimal coalescing of memory access [9].
106
Appendix F – Matrix-multiplication CC levels
Kernel CC 1.1 CC 1.3 CC 2.0
Resulting matrix 7.80 gigaflops 7.80 gigaflops 14.76 gigaflops
Tiled v1 19.12 gigaflops 19.11 gigaflops 19.86 gigaflops
Tiled v2
Latency hiding 31.97 gigaflops 31.95 gigaflops 33.57 gigaflops
Tiled v3
Prefetching 32.83 gigaflops 32.84 gigaflops 34.57 gigaflops
Tiling v4
2 outputs/thread 33.46 gigaflops 33.43 gigaflops 36.90 gigaflops
Tiling v5
4 outputs/thread 39.47 gigaflops 39.43 gigaflops 37.06 gigaflops
Results from the matrix-multiplication compute capability levels test on platform #4.
107
Appendix G – Report page count
The number of pages in this report is estimated from the following rule:
• 2400 characters constitutes a normal page
Characters:
This report hols in total 142,222 ~ 59,26 pages