Investigating the Potential for Hardware Accelerating of ...€¦ · Investigating the Potential...

UNIVERSITY of

GLASGOW

Investigating the Potential for Hardware Accelerating of Artificial Retina Transform

Argyrios Yfantis

September 2008

ii

University of Glasgow Faculty of Information & Mathematical Sciences

Department of Computer Science

Investigating the Potential for Hardware Accelerating of Artificial Retina Transform

Argyrios Yfantis <[email protected]>

A dissertation presented in part fulfillment of the requirements of the Degree of MSc

in Computing Science at The University of Glasgow

iii

Abstract

The present dissertation, reports the successful completion of designing and

implementing a model for producing a software library that implements a biologically

motivated vision system that achieves near real-time performance. This is done by

mapping a pre-computed artificial retina into the parallel architecture of a modern

graphics card that will primarily enable to sample and reconstruct any image input

and create pyramids. The computational cost of the described problem is currently

one of the principle limitations in Computer Vision that does not allow biologists or

computer scientists to experiment with biologically motivated vision systems. With

the proposed approach new horizons are created on various research fields such as

image processing, as the proposed model does not lack anything from a traditional

vision system, it is fast (performance increased up to 92x comparing to previous

implementations) and easy to use. A careful validation procedure is described that

ensures the quality of the produced results both on static images and video input.

By reading this dissertation, the reader should have understood the scope of the

problem, its specific objectives, its needs, its limitations, the author’s approach and

what has been achieved. Results are presented in various ways, for the reader’s own

judgment.

This dissertation is submitted in part fulfilment of the requirements of the Degree of

Master in Computer Science at the University of Glasgow

iv

Acknowledgements This project would have never ended successfully if some people haven’t helped me

in any way, during its progress. I would like to thank the following:

My supervisor, Paul Siebert for firstly giving me the chance to explore this new and

exiting field of biologically-motivated vision, but also for his constant help, advice

and encouragement during the development, thanks to his passion and devotion to the

field.

Sumitha Balasuriya, for his prototype work on artificial retina that helped me to

understand every single detail on the field, as well as for taking the time to leave a

comment on my results.

Paul Keir, from the Computer Vision and Graphics Research Group, for his valuable

advice on CUDA programming, thanks to his previous experience.

Indradeo Ram, again from the Computer Vision and Graphics Research Group, for

providing me the Matlab code to generate the logz retina tessellations.

Finally, to Robert Fletcher from the University of York for his useful contribution to

the GLUT library and for providing me the source code.

v

Contents Abstract iii Acknowledgements iv List of Figures viii Chapter 1 Introduction 1

1.1 General Introduction 1 1.2 Broader Objectives 1 1.3 Broader Motivation 2 1.4 Graphics Processor Unit 2 1.5 An Introduction to Artificial Retina 3 1.6 Dissertation Outline 4

Chapter 2 Research Problem Statement 5

2.1 The Problem 5 2.2 Objectives Addressed 5

2.2.1 Primary Objectives Addressed 5 2.2.2 Secondary Objectives Addressed 6

2.3 Challenges 6 2.4 Motivations 7

2.4.1 Biological Motivation 7 2.4.2 The Power of Parallelism 7 2.4.3 A New Machine Vision Approach 8

2.4.3.1 Introduction 8 2.4.3.2 Traditional Machine Vision 8 2.4.3.3 Biologically Motivated Machine Vision 9

2.5 Discussion & Conclusion 10

Chapter 3 Previous Work 11

3.1 On Human Vision 11 3.1.1 The Human Eye 11 3.1.2 Processing Visual Information 12 3.1.3 Imitating the Eye 13

3.2 On Artificial Retina 13 3.2.1 Some Observations 13 3.2.2 Creating the Tesselation 13 3.2.3 Making the Tesselation 15 3.2.4 Image Sampling & Reconstruction 16

3.2.4.1 Sampling an Image 16 3.2.4.2 Back-Projecting the Receptive Fields’ Responses 17

3.2.5 Previous Attempts to Model the Retina 18 3.2.5.1 The log(z) Transform 18 3.2.5.2 The log(z+α) Transform 19 3.2.5.3 Uniform Fovea Models 19

3.2.6 Hardware Retina 20 3.2.7 Uses of an Artificial Retina 20

vi

3.2.7.1 Image Processing 20 3.2.7.2 Pyramids 21

3.3 On Software 21 3.3.1 The Matlab Designed System 21 3.3.2 The Java Designed System 24

3.4 On GPU Programming 25 3.4.1 A New Programming Language (CUDA-C) 26 3.4.2 Architecture 27 3.4.3 Programming Model 30 3.4.4 Memory Issues 30

3.4.4.1 Bank Conflicts 30 3.4.4.2 Memory Coalescing 30

3.5 Discussion & Conclusion 32

Chapter 4 Implementation 33

4.1 Functional Specification 33 4.2 Non-Functional Specification 34 4.3 Fields to Focus 34

Retina Storage 34 Overlapping Areas 35 Distributing Workload 35 CUDA-Related Issues 35

4.4 Design 36 4.4.1 Sampling 36 4.4.2 Reconstruction 37 4.4.3 Pyramids 38

4.5 Basic Algorithms 40 4.6 Implementation Notes 40

4.6.1 On HART Library 41 4.6.2 On CUDA Programming 44

4.7 Results 46 4.7.1 Reconstructed Images 46 4.7.2 Reconstructed Pyramid Layers 49

4.8 Discussion & Conclusion 50 Chapter 5 Validation 52

5.1 Validation Process 52 5.1.1 Measurable Variables 52

Speed 52 Responses Accuracy 53 Image Quality 53

5.1.2 Experiments Description 53 5.2 Retinal Responses 55 5.3 Reconstructed Images 57 5.4 Pyramids Evaluation 58 5.5 Execution Times 59 5.6 Kernel Optimization 62 5.7 Retina Optimization 65 5.8 Discussion & Conclusion 69

vii

Chapter 6 Final Conclusion 71

6.1 Overview 71 6.2 Further Work 72

References & Bibliography 74 Appendices 78 Appendix A: Occupancy per Kernel Results 78 Appendix B: Full HART API 81 Appendix C: Examples 82

viii

List of Figures

1-1 A modern graphics card 3 1-2 Procedure outline for a biologically motivated vision system 4 2-1 Structure of a digital picture 9 3-1 Anatomy of a human eye 11 3-2 Representation of a human eye’s retina 12 3-3 Detailed representation of a human eye’s retina 12 3-4 Representation of an 8,192-node artificial retina 15 3-5 Receptive fields structure 16 3-6 Back projection of grayscale image 17 3-7 A logz model sampled image representation 18 3-8 A loz+α model sampled image representation 19 3-9 A uniform fovea model sampled image representation 20 3-10 Cortical filters functionality 21 3-11 The original Matlab Vision System architecture 22 3-12 An example of creating a small retina tessellation 23 3-13 An example of reconstructing an image with a small retina tessellation 24 3-14 The Java designed system architecture 25 3-15 CUDA hardware model 28 3-16 CUDA thread batching model 29 3-17 CUDA memory model 29 4-1 Basic architecture and data flow of a biologically-motivated application 33 4-2 Sample file sizes for some of the generated retinas 35 4-3 Sampling design of the HART library 36 4-4 Reconstruction design of the HART library 37 4-5 Artifacts in reconstruction generated due to race conditions between threads 38 4-6 Pyramid creation design of the HART library 39 4-7 Receptive fields’ minimum distance for pyramid layers’ creation 39 4-8 Main data structures of the HART library 41 4-9 Overview of the HART library 42 4-10 Overview of the HART functions that are executed on the GPU 43 4-11 Gaussian reconstruction without normalization 47 4-12 Gaussian reconstruction with normalization 47 4-13 Voronoi reconstruction 48 4-14 Logz retina reconstruction (both Voronoi and Gaussian) 48 4-15 Pyramid layers Voronoi-based reconstruction 49 4-16 Pyramid layers Gaussian-based reconstruction 50 5-1 The standard Lena image that was used during testing 54 5-2 Vectors difference between HART and Matlab 56 5-3 Vector attenuation over eccentricity 57 5-4 Pixel-by-pixel difference of HART’s and Matlab’s Voronoi reconstruction 57 5-5 Pixel-by-pixel difference of HART’s and Matlab’s Gaussian reconstruction 58 5-6 Rooted Mean Squared Error for the pyramid layers’ responses 59 5-7 Execution times and speedup compared to Matlab version 60 5-8 Screenshot of a program while using HART in real-time 61 5-9 Execution times varying retina size 61 5-10 Execution times varying pyramid size 62

ix

5-11 Register and shared memory usage per kernel 63 5-12 GPU occupancy varying register count 63 5-13 Occupancy varying shared memory usage 64 5-14 Fovea’s field of view recommended values 65 5-15 Execution times varying blocks number 66 5-16 Execution times varying threads number 66 5-17 Reconstruction times varying tile size 67 5-18 Execution times varying multiprocessors 68 5-19 Optimal Multiprocessor variable values 68

1

Chapter 1

Introduction

The very first chapter of this dissertation will introduce the reader to the general nature of the problem and what is tried to be analyzed. Its broader objectives and general motivation will be explained and the term “artificial retina” will be analyzed. In the end exists a basic description of what is about to follow. After reading this chapter, the reader should be ready to understand in detail the problem as well as the proposed solution, which will be explained on the following chapters.

1.1 General Introduction This project involves the development of a hardware accelerated artificial retina and

use it in order to achieve real-time performance on transformation functions needed by a biologically motivated vision system. The present research work is based on a previous research done by the Computer Vision and Graphics Group in the Department of Computing Science at the University of Glasgow, on a biologically motivated machine vision system for object recognition [1]. The timing performance of the previously designed system is prohibiting in order to make use of it in real time conditions, limitating its use. The artificial retina is inspired by the way that mammal vision works, allowing variant densities along the visual field. The basic operations in a biologically motivated vision system are to sample and reconstruct an image, given any retina model. Sampling an image should take less possible time if it is to be processed further on by an Image Processing algorithm, being part of a full application. The solution to this limitation will be given by this work, and is done by mapping the artificial retina into a graphics card’s parallel architecture, using special programming techniques.

1.2 Broader Objectives

Justifying scientifically Dr. Denis Waitley’s words that “seeing is believing” is not

too difficult if one bear in mind that 70% of the information that the human brain receives and processes derives for the eye. This is an evidence of the importance of the eye to humans as well as to almost the majority of the mammals. But what a human understands conceptually by an image is not actually what his own eyes see because human brain intervenes between them, hopefully or unfortunately, to construct the final image. So far, all the information that is processed either in photographs or motion video is a representation of that “brain image”. However, recent advances in biology anatomy demonstrated that there is also a pre-processing stage taking place in the eye before the signal is sent to the brain. The general idea behind artificial retina is that if we manage to get a picture of how things are

2

represented at a lower level, we may are able to access parameters of the human vision that were previously unknown.

This present project tries to overcome the computational cost of sampling an image with an artificial retina using state of the art programming techniques and taking advantage of the new, cheap and powerful hardware (GPUs). Given a well defined reference model for an artificial retina, the objective is to sample and reconstruct an image very fast. By this implementation, the advantages of using GPUs and parallel techniques for general purpose computations will be exploited. Code optimizations as well as optimizing different retinas with different GPU parameters, are a big part of the present research that helped to further improve the performance as exhaustive testing has shown afterwards.

The library was expanded as much as time allowed, in order to support as many functions exist in traditional vision systems. Having a space-variant vision engine that bridges the gap of large execution time, all the known Image Processing algorithms can be adapted to work with it afterwards. Different representation models are compared as well.

1.3 Broader Motivation The writer was inspired to do the present work by some known applications of

demanding tasks implemented on the graphics card. Throughout computer history, increasing performance has been always the motive force in Computer Science. Graphics cards’ total throughput is doubled every year. And nobody seems to dislike this evolution. Their development rate is no longer following the rules of the well known Moore’s law [41], wanting the data density in transistors to be doubled approximately every 18 months. From research biologists, who will have the option to perform faster experiments, to computer programmers, who will gain experience with another high-end design and computer vision researchers who will have the option of applying already existing methods to a new set of data – that of space-variant vision.

1.4 Graphics Processing Unit

A graphics processing unit (occasionally known as visual processing unit) is a computer’s device that deals with rendering graphics and video output from the computer to the screen. Modern GPUs are very efficient, because of their parallel architecture and their overall processing throughput. This happens, because they incorporate special, designed microchips that are dedicated to calculating mathematical operations with floating point precision, which are required by 3D graphics applications.

Throughout history graphics cards have been evolved in various aspects and they have changed form, many times. In late 1970s GPUs didn’t exist at all. They were simply a microprocessor without drawing capabilities. In the 1980s Commodore Amiga was the first computer to have all the graphics functions in one single chip. In the 1990s and the development of OpenGL by Silicon Graphics, high resolution 2D bitmapped pictures were drawn by the GPUs which then had their own buffer. This is when a common VGI (Video Graphics Interface) controller was useful. Throughout 1990s a constant evolution of software (Microsoft with DirectX helped into that) and hardware took place, which gradually moved GPUs into 3D graphics, realistic texturing and geometry, stream processing, programmable graphics in the late 1990s

3

and finally, in 2000s the idea of general purpose GPU (GPGPU) was introduced. Simply, the processing throughput of the GPUs, has reached at such point that they are faster than any CPU and the entire floating point accelerator shader pipeline can be used for general purpose computation. This is happening because graphics always demanded a lot of processing power. Newer graphics cards devote more transistors to data-parallelism computations than on flow control or caching, as CPUs do.

Using a special unit that deals only with a single purpose is always better than having a central processor that deals with many jobs of different nature. The beginning was made with graphics and it has extended to other demanding applications as well, like audio reproduction and physics-based simulation.

For the needs of the present work, Nvidia’s CUDA (Compute Unified Device Architecture) GPGPU approach has been chosen to solve the problem. It is currently one of the few libraries that supports GPGPU and it’s well supported by a wide community that offers continuous updates. Its architecture and functionality is described on chapter 3.

Figure 1-1: A modern graphics card (Nvidia 8800 GTS), produced in 2006. Image taken from [32].

1.5 An Introduction to Artificial Retina Abstractly, one can think of an artificial retina as a map of sampling points mimic

the sampling strategy of the mammalian eye retina. As it is illustrated in Figure 1-2, each point resembles the center of a circle-shaped receptive field (its support region), which its size grows bigger as we move away from the center of the retina. The density of the points is expected to be higher in the center (resembling the fovea region of the human eye) and sparse at the surrounding region.

However, the problem is how these points are calculated and how the projection is made between the input image and the retina tessellation map. The described transformation is known as Retino-Cortical (RC) transformation and is well researched because it is said to remove the effects of scale, 2D rotation and other projective distortions [2, 3]. Many methods have been developed and some of them will be presented in this dissertation.

After constructing the retina, the main use of it, is that one can apply it to an image and get the responses from the receptive fields. The result will be a somewhat unusual description of the same image, which will have all the advantages mentioned before and can still be processed and edited by any conventional Image Processing algorithm. Afterwards, an image reconstruction can be computed in order to visualize

4

information. This final image will be viewable and conceptually understandable as the original one, at least theoretically.

Figure 1-2: Process of sampling an image, using retinal receptive fields, storing it to an image vector

and reconstructing the original image from it. Image adapted from [1] with changes.

1.6 Dissertation Outline This project dissertation is split into different chapters, allowing better organization

of the knowledge. After getting a basic introduction, on the second chapter there is detailed state of the problem, why it is worth solving it and what objectives have been achieved. An introduction to the validation process is made as well. On the third chapter, all the relevant previous work that has been done on the field is analyzed. This project is based on the previous work and tries to solve some of their weaknesses. An introduction to existing software tools exists, that the writer used to understand their weaknesses. On the fourth chapter, the basic design is described, some implementation details are given and results are presented. On the fifth chapter, the results are validated according to the test procedure, as has been described previously. Finally, on the last chapter, after reviewing the previously presented information, the writer critically analyzes the results, outlines the overall advantages of the current implementation and gives some interesting ideas for further work.

♦

5

Chapter 2

Research Problem Statement

In this chapter, the problem is defined as well as what objectives have been achieved in terms of the problem. A detailed validation process is also described that will be followed later on to validate and assess the results. The chapter in the end will inform the reader on why it is important to solve this specific problem. On the following chapters a solution to the stated problem will be given, after presenting the scientific background on which the solution was based on.

2.1 The Problem

As it has been already mentioned in the previous chapter, one of the principle limitations that discouraged researchers to use the biological motivated machine vision theory, was that it was time and resource consuming, which made its use prohibiting specially for real-time systems. The presented work makes an effort to implement a set of algorithms, using modern graphics’ cards capabilities, which will enable users to get a near real-time performance of the retinal function (i.e. retino-cortical transformation). The present work will also try to address the advantages of using such techniques and explain why it is important for the future.

2.2 Objectives Addressed 2.2.1 Primary Objectives Addressed

In the presented work we are not interested primarily in generating the retina

tessellation. An assumption is made that an algorithm that generates the retina tessellation is provided in this work (see [1]), because usually it won’t have to change frequently in a space-variant application. In case that is needed (for example, when using pyramids) we can have many different retinas on memory and choose each time which one to use. It also presents scientific interest to compare different models and that is what is done in this project. This is why the implementation is dynamic and useable for any kind of different tessellations and must interface to either image sequences or video stream. Any parameters can change on execution by passing arguments. This adds further flexibility as it allows, for example, the retina to “rove” during execution time. Some reference artificial retina models were created using Balasuriya’s original Matlab code [1, 4] by varying the number of nodes. Other retina models were tested as well for comparison. The same code is used for creating layers that are used in pyramid construction.

The main objective fulfilled with the present work is to create all the necessary functions that:

6

• Manage an already existing retina (save and load from a file or memory) • Get the responses from the receptive fields (sample an image) • Project the responses back to an image (two methods are available)

The objectives above are not trivial to implement, and require careful thought and

design methods. The main reason is that all the above must be done fast. As fast as the eye sees. In case, it does not make any sense, in computer terms, it can be translated to a workload of 25 frames per second. The result of the present project is a library that can be used as an intermediate filter before processing any kind of image data. It provides a biological motivated view of the world.

The speed constraint however still exists, because of the dimensionality of the data. Generally, it was concluded that the initial claims are feasible to happen but up to some extent, meaning that the library is capable of achieving near real-time performance and not completely real-time. Either way, overall performance was increased as it can be seen in chapter 5.

2.2.2 Secondary Objectives Addressed

Upon the successful completion of the primary objectives, the author had sufficient

time to work on some other aspects as well, in order to complete his work. Without reducing the importance of the primary objectives, the author extended the research work in order to expand the library to support the following:

• Fast creation of pyramids on the GPU. • Normalization process when back-projecting to image plane when using Gaussian reconstruction, creating smoother results. • Full color compatibility (RGB). • Basic displaying functionality.

As alternatives to the above, it has been planned to either perform basic Image

Processing with use of cortical filters or to automatically create the retina tessellation on the GPU.

Finally, as the library has been thoroughly tested it is fully interoperable with any other imaging software (including the original Matlab vision system) or developing tools by using new efficient data types, making it more useful. All the above are provided in an understandable and easy to use Application Programming Interface (API).

2.3 Challenges The described work represents a challenge that is significant greater than

implementing an algorithm on a common CPU architecture. Various aspects of the problem had to be taken into account before development begins. In a general overview, the main processing task was a set of data structures resembling the retina’s receptive fields (a set of matrixes), which are convoluted to an image area (also another matrix), making calculations and stores the result in an appropriate indexed manner.

The approach, hides some critical points because, the number of the matrixes is very big, not known and of different size. Moreover, values are floating point, which

7

increases computational complexity. Each matrix needs to have access to parts of the image that may not exist in memory i.e. processing is not done sequentially. So dealing with memory issues or how to accumulate results is a challenge. The approach must not be brute force-like, as overlapping parts exist. So there is a risk of transferring the same image data to memory more than once, which is inefficient. Storage is critical, because generated kernels can get very big and so are the data structures that will be used as GPU memory is not endless. Mapping the retina architecture and computational structure to the CUDA architecture such that we can harness the potential parallelism of the Nvidia graphics card within reasonable memory limits, is another challenge. Transferring data form the CPU memory to the GPU memory is certainly the bottleneck of the application. An investigation was also required to determine what is really needed to be transferred, and how the data structures should be defined. Problems become more significant if we examine at Nvidia’s GPU architecture, which sets some serious limitations and will be explained on chapter 4.

2.4 Motivations 2.4.1 Biological Motivation

Scientists working into Biological Motivated Computer Vision research field are

interested in having a true and reliable way on representing the retino-cortical transformation in real-time. Since experiments in biology are difficult, not so often and sometimes risk evolved, because of their nature, having a simulation of the eye would certainly help them on making conclusions on how the eye reacts under different conditions. Moreover answers may be given to questions regarding various vision disruptions due to retina damage. It has been reported that blunt trauma can damage parts of the retina, which results in partial or full visual loss [29].

2.4.2 The Power of Parallelism

As it has already been mentioned, one of the major drawbacks of the artificial retina

is its complexity which requires huge amounts of both processing power and time. But during the last decade an exponential growth of processing power has been achieved in graphics cards, which currently are faster than the central processors of a computer system. To mention an example, one of the latest cards from the dominant graphics cards’ manufacturer, Nvidia (model: GeForce 9800 GX2), counts about 128 different stream processor units, with each of them running at 1500 MHz, sharing a total of 1,024 MB of memory at 1000 MHz clock speed with 256 bit memory bus. According to benchmarks [33], the total throughput of the card is approximately 768 GFlops. This is translated to a 768 billion floating point operations in a single second. With the cost of obtaining such a card being relatively reasonable, it has been noticed that more and more researchers adapt their implementations into parallel systems, since it seems to be the near cutting edge future in computer systems. Additional, performance gains can be achieved by combining two graphics cards, using the Scalable Link Interface (SLI), a technology provided by the manufacturer. In that case the total throughput is regiment to reach at 1 TFlop. Other manufacturers, such as ATI-AMD, have presented similar powerful models with similar techniques for combining the power of two graphics cards. ATI-AMD also plans to release a high

8

performance computing devoted stream processor, under the name “Firestream” with 500 GFlops on a single card, but currently there are not widely available.

In the specific problem implementing the algorithms with a parallel programming techniques, is first of all, a necessity if real-time performance is required and secondly, a challenge which requires a good design about splitting the total workload.

2.4.3 A New Machine Vision Approach 2.4.3.1 Introduction

“Computer Vision” and “Machine Vision” are two terms that are often used to

describe the same thing. One might say that machine vision is the knowledge of computer vision transferred into manufacturing industry. In brief, machine vision is the set of disciplines in order for a computer to save and understand an image. According to [14], the purpose of such system is “to analyze images and produce descriptions of what is imaged”. Usually, people want to extract certain information from an image a process it afterwards in a different way. That is why machine vision systems are often considered and used as subsystems of bigger systems. Machine vision is very close with other research fields such as Image Processing or Pattern Recognition, but implementations can be found almost everywhere, from Medicine to Astronomy. That is why it is an important and constantly developing research area.

2.4.3.2 Traditional Machine Vision System The traditional way of representing an image to a computer system is also the

simplest one. The basic requirements are that at any given time we are able to:

• point out where an object is in the picture • tell how much bright or dark a specific area is

The result is a two-dimensional array, which every value represents the intensity of

each pixel. The most common types for images are three. Black and white, which means that the values will be either 0 (for black) or 1 (for white). Grayscale, in which the possible intensity values varies from 0 to 255, enabling us to represent more detail. Finally there are the color images. In this case each picture is a set of three two-dimensional arrays, one for each color (red, green and blue - RGB). According to the chromatic model, this might change. CMYK uses four arrays (cyan, magenta, yellow and black). The intensity scale again varies depending on how many bits we are using to store the values. So there are 8, 16, 24, 32 and 64 bits available to store a value, with more bits resulting to better quality. The projection of the real world into a digital picture is done by using the pinhole model. Apart from the images described above, more representations exist that exploit intermediate results, such as gradient images.

9

Figure 2-1: Representation of a simple grayscale digital image. Adapted from [30] with changes.

Because of the simplicity of the traditional machine vision systems, it is very easy to perform operations between the pixels such as convolution. However, there is a significant drawback regarding the size of an image. As the method was described it is clear that if someone wants to add more detail to an image or increase its dimensions then the size will increase as more bits are needed for the color and more array values are needed for the extra size. In real-time systems where quality and speed are both necessary, using conventional images at high resolutions requires prohibitive processing times. Traditional machine vision systems also face difficulties when representing and manipulating arbitrarily sampled images.

2.4.3.3 Biologically Motivated Machine Vision

In order to overcome the problems of the traditional vision systems described above,

a new system was introduced. It is commonly known as the mammals’ vision system. The biological motivated machine vision system has the same characteristics as the traditional one, plus one more. It uses space-variant sampling and processing, borrowed from the nature.

Space-variant sampling applies a non-uniform sampling upon the image, which reduces dimensionality of the data and therefore requires less processing power and time. That is how genetic evolution managed to reduce the information in order to be efficiently processed by the not so powerful visual cortex. Human brains would weigh around 60 kilograms if the sampling density at the fovea were duplicated across the entire field of view. Large, high-resolution images are possible to be processed easily, that in traditional systems it would have been almost impossible [1].

However, so far the sampling process was such complicated and time consuming that it wasn’t eligible at all for real-time experiments. That is why the traditional vision system was been used widely. With the present work, if the implementation is successful, new horizons are opened in the machine vision, since people, will be able to get a low level view of the world at no cost at all. This means that, in theory, all the existing algorithms in Image Processing can be adapted in order to be applied to a new set of data - that of retina. Moreover the performance of those algorithms will not be affected by the size of the input data, as retina takes care of that by decreasing the quality appropriately. This is a unique advantage. The writer believes that the presented work will boost the interest of the researchers and make them reassess

10

which representation of the image they prefer. The main motivation to do so is the data reduction that is possible to be achieved. This can help compression algorithms, as well as overall efficiency of existing algorithms, as they are applied to fewer amounts of data. When dealing with even larger images, the need is stronger. To all above, it may added that even biologically motivated vision is a fairly old idea; few publications have been done in the field, which indicates that there is lot more to discover.

2.5 Discussion & Conclusion

In this chapter the problem, its objectives and writer’s motivations were stated in detail. Also, some information about the existing machine vision approaches was presented. The main objective of this work is to examine how is it possible to couch the retina algorithms a Nvidia GPU architecture, in order to increase performance. Some further work that was done additionally is described on section 2.2.2.

The problems of the traditional vision systems will be only exploited when we move into another scale. So far, real time systems (CCTV, for example) are working in fairly poor resolutions. As high-definition video has already entered our homes, we simple cannot continue using the same systems. However, computer systems have not advance in such point that would enable us to edit in real-time high definition input. Here comes the retina to help by creating this space-variant sampling and reducing the data.

The reader will get a clearer idea of the problem’s domain by the following chapter, where detailed explanations are given of what has been done on the research field.

♦

11

Chapter 3

Previous Work

This chapter presents all the background theory in which the current solution has been based on. The way of creating an artificial retina, creating tessellations, sampling an image and visualizing a retina, are explained. Already existing software and programming tools will be described and assessed as well. The tools that the author used to help solve the problem are described as well. By the end of this chapter, the reader should have understood the problem and its requirements in detail.

3.1 On Human Vision 3.1.1 The Human Eye

The human eye is an organ that can sense light. It is one of the most complicated

organs on the human body. The eye consists, mainly of four parts: • A tough outer layer, known as

sclera, which basically surrounds the entire eyeball. At the front of the eye the sclera becomes transparent (the cornea) in order to let the light enter the eye.

• An inner layer, called choroid, which leaves a small hole at the front (the pupil) to let the light enter the eye.

• The lens, which is simply a transparent protein disk located just behind the cornea. The lens helps humans to focus by changing its size. When focusing on a close object the shape of the iris is almost spherical and when focusing on distant objects the iris flattens. Some surrounding suspensory ligaments help to do that.

• Attached to the choroid, at the back of the eye there is the retina. The retina is responsible of translating light into signals that are sent to the brain through optic nerves and that is why it contains photoreceptors. There are two types of photoreceptors: rods (about 125 million) and cones (about 6 million). Rods are more sensitive to light than cones but they cannot distinguish color. This is why they contribute more on night vision. On the other hand cones are more responsive. Rods are found in greater density around the retina, than the cones,

Figure 3-1: Anatomy of the human eye. Adapted from [31] with changes.

12

which are only present in the fovea – the center of the visual field. Rods are completely absent from the fovea.

The rest of the eye is filled with a jellylike vitreous humor, which helps focusing the

light on the retina. In terms of color, it has been reported that the human eye shows high sensitivity to

higher light wavelengths (blue), medium sensitivity to medium frequencies (green) and low sensitivity to low frequencies (blue).

3.1.2 Processing Visual Information

The processing of visual information begins in the retina itself. Both rods and cones make synapses with neurons (horizontal cells, bipolar cells and amacrine cells), called ganglion, which in turn will transfer the signal to the

brain. Though, the connection is not direct. Other neurons

(ganglion cells) intervene during the communication, creating a quite complex neural system with a numerous of possible pathways for the information to reach the brain. A photoreceptor can transfer information directly to a bipolar cell and then to a ganglion cell and thus the image will appear sharper or firstly transfer information to a horizontal cell, which will accumulate information from other photoreceptors too, and then transfer a single signal to a ganglion cell. In the second case, the image will appear less sharp. All the rods and cones that feed with information to one ganglion cell form the receptive field of that cell.

Neurons, afterwards, will transfer the signal to the primary visual cortex of the brain to create the visual perception.

Figure 3-2: A chapter of retina. Taken from [34].

Figure 3-3: Zoom into the retina. Image taken from [34].

13

3.1.3 Imitating the Eye After studying the functionality of the eye for years, nowadays, this model is widely

used in consumer manufacturing in order to build machines that can take pictures. Photo or video cameras use lenses to focus the light and direct it through an aperture to a surface that would picture somehow what the lenses are pointing. The aperture has the ability of shrinking and enlarging, controlling the amount of light that enters and therefore increasing or decreasing the depth of field.

That special surface that translates light into a picture is traditionally a film stock, or, as electronics have advanced lately, a Charged Coupled Device (CCD). A CCD translates the light into electricity, which will be processed by special electronic circuits in order to produce the final picture in a digital format.

Of course more parameters affect the final version of the image (lens for example), but their investigation is out of the scope of the present work.

3.2 On Artificial Retina 3.2.1 Some Observations

The function of the artificial retina is to reduce the bandwidth needed to process full

resolution images. When each area of an image is processed equally with all the others, a huge amount of processing power is needed. The artificial retina proposes to sample the image non-uniformly in order to reduce information. The sample pattern consists of small central region (like the fovea in the real retina) where the sampling is of high destiny and as we move away from that region the sampling destiny is getting smaller in a gradual manner.

In the sampled image it is expected that the outer part of the sampled image will contain very coarse information. However, it must be ensured that the overall information is sufficient enough to offer a summary representation of the original image.

Furthermore, it is also vital that the visual information must be uniformly distributed along the image plane and that is why overlapping fields are being used when sampling. Otherwise, visual artifacts or the aliasing phenomenon will be present.

In the end, it is also expected that processing times for the fovea region will be relatively bigger than in any other region, because of the data size. This also happens in the human eye too. The amounts of nerves that process the fovea region are much more than the total of the nerves that process the rest of it.

3.2.2 Creating the Tessellation

A method to create a pseudo-random artificial retina is described by Clippingdale

and Wilson [10] and it is based on self-similar neural networks and the Kohonen learning rule. Balasuriya’s work [1], on which this projected is based, made a modification on this method by applying a composite transformation directly to the weights.

The network consists of a number of points, indicating the sampling points on the image, represented as an array. The network of points is transformed by carrying out a transformation on the network itself to create the input stimulus for the next iteration. It is an unsupervised procedure and it has been used in retina creation because it is

14

able to create a uniform fovea with a space variant periphery, resembling very much at the human retina.

At the beginning of the algorithm the array (of size N) is initialized randomly and then a recursively procedure follows, where in every single iteration the input would be a randomly transformed version of the previous output. This can be described as:

)1()()( −= nxnTny ii

where )(nyi is the i th point of the tessellation, )1( −nxi the i th point of the

tessellation at the previous iteration )1( Ni ≤≤ and T the transformation applied to the input points and it’s defined as the following steps:

1. A random translation about the centre of the coordinate space between 0 and 2π. 2. A dilation (increase in eccentricity from the centre of the coordinate space) of the

exponent of a dilation factor which is random between 0 and log(8). This causes the network units in the periphery to be transformed more than those inside the fovea.

3. A random translation between 0 and f, where f is associated with the size of the fovea relating to the full size of the retina.

At each iteration, the following learning rule is applied in order to calculate the

weight vector )(nx j :

∑

Λ∈

−−+−=)(

))1()(()()1()(ni

jijj

j

nxnynnxnx α

with { }jnxnynxnyin ijij ≠−−<−−=Λ κκ ,)1()()1()(:)(

where Λj(n) are the indices to the input stimuli, yi(n), for which xj(n-1) is the closest

network vector. The learning parameter, α , is linearly reduced during the self-organisation to help the network to converge faster.

Depending on the transformation method, results vary. Balasuriya [1] did an interesting benchmark on that. Whether translations happen horizontally and vertically or horizontally, vertically and radially away from the centre or only radially away from the centre or randomly, the shape of the fovea changed. Of course the ideal would be a truly rounded fovea, with no discontinuities in density, but the best result that it’s possible is a hexagonal. Interestingly enough, this is the same shape that nature uses for the mammalian vision.

15

Figure 3-4: An 8,192 node retina tessellation with 20% foveal region, created after 20000 iterations. This is the retina that was used in Balasuriya’s work.

3.2.3 Making the Tessellation Even if the retina has been created, nothing can be done yet. Simply sampling the

image point to point would create an unsmooth and aliased result. In fact, an average of a few neighbour points around each point is used. These sampling areas are known as receptive fields, exactly as they exist in biological retina. The size of them changes according to the local density. They will be small in the fovea and will get bigger as we move away from it. These receptive fields are calculated by finding the Voronoi region of each point pi. O’ Rourke [19] defines the Voronoi region as the set of points that are at least as close as pi as to any other site. Balasuriya [1] also defined a minimum size of a receptive field to 1.5 pixels. As each receptive field i is a Gaussian region, the following formula is used to calculate the standard deviation σi of it:

1

2,

−

−∑

=i

k

jji

i k

Ai

λσ pixels

16

where ki is the number of receptive fields in the Voronoi graph with a graph distance equal to one and λ is a fixed scaling constant which expands the receptive field’s standard deviation to prevent aliasing at the retina responses by blurring the input image.

Figure 3-5: A receptive fields sampling structure, with non-overlapping areas. e is the eccentricity (distance from fixation point) and θ is the retina angle with the horizontal axis. Image taken from [25].

3.2.4 Image Sampling and Reconstruction 3.2.4.1 Sampling an Image

The result of getting the responses from the receptive fields of a retina when applied

to image data is stored to a one dimensional array, called imagevector. However, in order to find which pixels is needed to be taken into account to calculate one response there must by a translation of the floating point coordinates in receptive fields to integer values that refer to the 2D image plane. Given a (Xi, Yi) floating point co-ordinates of a receptive field center which refers to a (x, y) pixel on the image, the horizontal and vertical sub-pixel offset (Pi, Qi) from the actual integer center location are needed to be calculated. To place a Gaussian receptive field support region to an image with sub-pixel accuracy the following equation is used:

( ) 2

2)(2)(

2,,,, i

iQiYiPiX

eQPYXG iiiiiσσ

−+−

=

Depending on whether the rounded integer size of the support region is odd or even

(possible diameters of 4σ or 6σ), the sub-pixel offset is calculated using the following formula:

Fovea

Extra-Fovea

e = |w|

θθθθ = = = = Arg(w)

jv

u

w plane

Receptor fields

17

−×−+=−×−+=

−=−=

=

iiiii

iiiii

iii

iii

ii

YYroundYsignYroundQ

XXroundXsignXroundPeven

YYroundQ

XXroundPodd

QP

5.0))(()(

5.0))(()(:

)(

)(:

),(

The final receptive field response is generated by multiplying the underlying image

pixels with Gaussian filter coefficients Gi as the following equation implies:

( ) ( )iiiiiinm

ii QPYXnmGnYroundmXroundIiR ,,,,,,)(,)()(,

σ×++= ∑∀∀

with iinm ασασ +→ ..., and Znm ∈, .

This is the value that will be stored on the image vector. Its index on the array

depends on the eccentricity. Another great advantage of this method is that due its usage of Gaussian kernels, sampling is immune to noise.

3.2.4.2 Back-projecting the Receptive Field Responses

Reconstructing the original image (at least, an approximation) using the sampled

one with the retina and its receptive fields reference model, is possible and can be done by the following equation:

+++←++ ))(,)(())(,)(( nYroundmXroundInYroundmXroundI iiii

2)2(),,,,,,()( ασ ×× iiiii QPYXnmGiR

with iinm ασασ +→ ..., and iZnm ∀∈ ,, .

where (2α)2 is a scaling factor that was used to prevent decay in the intensity of the

reconstructed image with eccentricity.

Figure 3-6: Gaussian back projection of an image vector, created by sampling with a retina of 16.384 nodes.

18

3.2.5 Previous Attempts to Model the Retina Apart from back-projecting the sampled image (known as image vector, as well)

into the original image plane, it is possible to visualize the image vector, often known as cortical image, because it’s the actual information that is sent to the visual cortex of the brain. A lot of methods have been discovered on how a “cortical” image can be displayed. The mathematical procedure of projecting the real co-ordinates of an input image to the sampled image (the result after applying the retina transformation) is known as retino-cortical transformation and some of them will be presented here.

3.2.5.1 The log(z) Transform

The log(z), or log polar, transform, created by Schwartz [24], is one of the most

common. For a set of (x, y) co-ordinates in the two dimensional Euclidean plane, it can be represented by z = x + iy.

Substituting these values with the polar equivalents we get: ⇔+= iyxz

⇔+= ]sin[cos θθ izz

)( πϑ niezz +=

where )arctan(x

y=ϑ and n is a real number.

Therefore, the transformation of these co-ordinates to the space-variant structure is given by:

⇔= + )log()log( )( πϑ niezz

ϑizz += )log()log(

Where |z| is the eccentricity, or distance from the centre, of a given point, and ϑ is

the angle the point makes with the positive x-axis, measured counter-clockwise. One of the drawbacks to this transform is the presence of a singularity at the centre

of the tessellation, resulting in super-Nyquist sampling in the fovea [1]. This means that large areas of the image are over sampled, producing redundancy of information.

Figure 3-7: A Log-polar tessellation (left) and a sampled image displayed in this space-variant image structure (right). Image taken from [1].

19

3.2.5.2 The log(z + α) Transform In order to overcome the problem addressed above, the log(z + α) transform was

created also by Schwartz [23]. Apart from that, it offers a better visual representation by splitting the image into two visual hemispheres. The two hemispheres will be processed separately. To achieve that, the transform simply adds a real parameter α to the polar representation of the Euclidian co-ordinates. So, the transform can be described by:

⇔

++=+

+ παα

nx

yi

eyxz2)(arctan

22)(log)log(

++

++=+

ααα

x

yiyxz arctan)(log)log( 22

However, the cost of this transform is that, depending on the value ofα ,

information on areas that are oversampled may be reduced or even disappear.

Figure 3-8: A log(z+α) tessellation (left) and a sampled image displayed in this space-variant image structure (right). Image taken from [1].

3.2.5.3 Uniform Fovea Models Another way to overcome the super-Nyquist problem is to use different sampling

topology in the fovea. Bolduc and Lenive [6] sampled the fovea region uniformly and used two images to display the cortical image. One for the fovea and one for the rest, according to the log(z) model. Gomes [14] used a hexagonal tessellation for the fovea and projected everything into one cortical image.

These techniques, unfortunately, suffer from sampling discontinuities and cannot be managed and processed as a single unit, which requires extra needs.

20

Figure 3-9: A uniform fovea and a space-variant periphery (left) and a sampled image displayed in this space-variant image structure (right). Image taken from [14].

3.2.6 Hardware Retina Instead of performing the image sampling by software tools, some related work has

been carried out to do it in hardware. Sandini [22] created a CCD version and Ferrari [12] a CMOS (Complementary Metal-Oxide Semiconductor) one. Both of them have the same logic and it is to vary the placement of photo detectors on the chips. Bolduc [6] concluded that in order to achieve a 10 frames per second processing performance, a parallel system is needed with at least six digital signal processors to calculate the retina responses. This idea certainly achieves closer real-time performance than software techniques and saves valuable processing power and memory space when used along with software-based vision systems. However, the cost of creating such retinas is very high in terms of hardware costs, they need special hardware that is not widely available and they are not suitable for research purposes, where researchers need to adjust various parameters to achieve the result that they want. So, their use is pretty much limited.

3.2.7 Uses of an Artificial Retina

The following paragraphs provide a basic introduction on what is possible to be

done with a retina and how. The scope of this chapter is to present to the reader how some major concepts of Computer Vision are still useable and practicable to a biological motivated vision system, but in a different way.

3.2.7.1 Image Processing

First of all, the reader should have already understood the importance of working

with space-variant vision. So, sampling an image and then projecting it back to a conventional 2D plane in order to process it is not actually the point of the work. The case is how it is possible to work directly with the sampled image, i.e. the image vector. Conventional fixed convolution kernels cannot be used to a cortical image for the following reasons:

21

• The information is not equally distributed. Fovea contains more information than other regions.

• Connectivity is not uniform. A pixel in a digital image is known to have 8 neighbours but a receptive field in a cortical image can have 4, 5, 6 or 7 neighbours.

• Information is not uniform either. Data stored in an image vector depend on their eccentricity.

However, related work [1] introduced the idea of “cortical filters” that can substitute

convolution kernels. The logic remains the same, with a cortical filter to be a greater support region of a number of receptive fields that edits the responses of these specific receptive fields. The disadvantage of those filters is that the relation between the receptive fields changes depending on the retina, so there is no way for them to be standard and fixed. Instead, the kernel coefficients are unique to the location of each image vector.

Image 3.10: An illustration of how cortical filters look like. Image taken from [1].

Applying a cortical filter to an image vector will create a new image vector that will

contain its response and thus all the changes that happened. Furthermore, many interesting approaches using fovea sampling have been done

during the last years, showing that the specific research field is quite active. To mention few of them, Brugnot [7] created an object recognition system by comparing the input image to a database of images. A scale-space recognition algorithm was developed by Siebert [26] and Boyling [5] created a binocular robot head for creating 3D models.

3.2.7.2 Pyramids Pyramids are widely used in Image Processing for a numerous of reasons. Pyramids

actually save an image at multiple resolutions. There are two methods to create them, Gaussian or Laplacian of Gaussian. Because they are so vital, they couldn’t have been absent from the biological motivated vision.

The way that they are created is very easy. Sequential cortical filters are used to calculate the response of the previous response. Because of the way that retina works,

22

the dimensionality of the data will be decreased. It has been reported that the data reduction between two layers is N / 4, where N is the number of points in the current retina tessellation. Again, a pre-computation of the kernel coefficients is required.

Burt [8], firstly introduced the foveated pyramid, in which the visual information is reduced from coarse to fine layers, where the fovea remains the same in all layers. The foveated pyramid achieves space-variant extraction of visual information.

3.3 On Software 3.3.1 The Matlab Designed System

Balasuriya [1], see also [4], firstly created a full space-variant vision system that

was used to extract features from images and saccadic targeting. The original code was written in Matlab, counting 27 functions. Below is an overall view of his architecture:

Figure 3-11: The Matlab system architecture. Image taken from [11]. Matlab is a powerful and quite popular scientific tool that is used from a large

number of scientists around the world and it’s widely accepted. Programming in Matlab is of a very high level which is both good and bad, depending on the needs. It is good because it is very easy as there are thousands of build in functions, which allows the programmer to write fewer code and not worrying about any details, focusing on his main problem. On the other hand it is bad for the exact same reason. Because the programmer doesn’t know what is happening on the background, it is impossible for him to optimize his code resulting, very often, in slow and clumsy programs. Moreover Matlab does not create object code Matlab can communicate with other programs written in C, FORTRAN or even Java but compatibility issues arise sometimes.

Creating an artificial retina is as easy as calling the function:

W = ssnn(1024, 0.07, 1020);

where the arguments are the number of nodes (1,024), the size of the foveal region

in percentage (7%) and the number of iterations (1020).

23

Figure 3-12: The generated retina by the previous command. It is noticeable that 1020 iterations are fairly low, comparing to figure 3-4. Density is not well distributed. After sorting the nodes according to eccentricity, he would use the function:

so_pyramid(W, inter_levels, min_rf, s, s_log);

in order to create a pyramid, based on the retina tessellation W, the number of layers

in an octave inter_levels, the minimum distance between two receptive fields min_rf, the σ value for the Gaussian and the σ value for the Laplacian of the Gaussian.

Finally, after finding the receptive fields (with neighbour_locations and neighbour_indices functions) he would be able to sample an image and project the sampled image back to its original form as follows:

R = rc_sample_wilsonretina_xy(I, ParaG{1, 1}, ParaG {1, 2}, x, y); I = voronoi_display_responces(I, VG, ParaG, 2, x, y );

where ParaG is the locations of the receptive fields and x, y the coordinated of the

point of fixation of the retina. Projection is done using Voronoi regions. Function rc_reverse_wilsonretina_xy

provides the Gaussian way for a retina and rc_sample_wilsonlayer_xy for a pyramid layer. Displaying intermediate pyramid layers with Voronoi regions is not supported.

24

Figure 3-13: This is the result of reconstructing a sampled image, with the above commands. Anyone can notice the low quality, due to the small retina that has been used. The image should look like the one in figure 3-6 (ideally). The rest of the code is out of the scope of this project. To give an idea of how demanding the current code is, the number of nodes is

limited to 8,192 due to memory problems, in which the processing time to create a tessellation of that size would take around a week. Also, given the retina above, in order to reconstruct an image the program needs an average of 8 seconds. Also, from the figure 3-13 one can see the importance of having many nodes on a retina. Baraniuk and Kelly at [36] reported that for the construction of a 5 megapixel picture 200,000 measurements are needed, using a single-pixel camera. Their research differs from the current one, but gives an idea of the number of sampling points.

Moreover Balasuriya [1], in order to reduce time while he calculates the pyramids he applies the cortical filters on the responses of the previous created layer which results to quite aliased zoomed images.

3.3.2 The Java Designed System Fegan [11] re-implemented some parts of Balasuriya’s vision system into Java. His

initial objective was to improve the overall system’s performance and scalability, without changing the design. His approach gave an object-oriented view of the problem. The proposed architecture can be seen above:

25

Image 3-14: The Java system architecture. Image taken from [11]. Java, as a real programming language, may be suitable for scalability but in terms of

performance is still not enough. In order to increase the portability of Java, a compiler will translate the code to an object file, which is only executable by a Java Virtual Machine (JVM). Then, apart from the platform and computer’s architecture, the JVM will translate in real-time the object file into something that is understandable by the specific computer and platform. Fegan achieved an average of 40% gain in performance comparing to Balasuriya, while creating an artificial retina tessellation and needs about 2 seconds for an 8,192 node image sampling, which is still not enough for real-time processing. Lower tessellations can be used (512-node needs one second), but quality decreases heavily. His system claims to support bigger tessellations, though it has not been tested.

Furthermore, his results while sampling and reconstructing an image are incorrect comparing to those of Balasuriya, due to unknown reasons. Probably, he got wrong responses because of misplaced sampling points. As he describes on his dissertation he copied exactly the code from Balasuriya. This seems to be too naïve to work. Matlab is not Java or C, in order for the code to be portable without implications. In the current implementation, the author performed numerous tests in order to get the correct results. And he concluded that changes were needed in the original code in order to be functional. These changes are documented on the code.

3.4 On GPU Programming Programming a GPU is an arising field in Computer Science that attracts more and

more programmers every day, mainly because of its advantages. However, to take advantage of it, a programming language is required.

The beginning was made at [21] where a high level programming language was introduced for the first time, which allowed programmers to edit pixels’ values or make operations with them. Gradually processing each pixel would result in transforming a whole image area. On the late 90s as game industry was advancing new techniques were found in order to reduce workload from the CPU. Newest 3D graphics required more processing power that a single CPU couldn’t offer. And there was the crucial point where developers got the chance to program the GPU for

26

controlling processes such as polygon transformation, shading and rasterization, much faster than before. Image processing algorithms started to be used in GPUs, because of their nature. The first generations offered limited forms of programmability and were assembly-like and vendor depended. But as the market was maturing higher level languages were introduced, such as Cg from Nvidia, HLSL by Microsoft and GLSL by the OpenGL Architecture Review Board. As the needs for better graphics were increasing, we reached today were we have hundreds of shader processors with enormous speeds, some very long pipelines and memory equal to the whole computer system’s memory. Then it was realised that such extreme processing power can be used to accelerate common computational tasks. This is how the idea of General Purpose GPU (GPGPU) was introduced in the late 2001, which made them a rather targeting platform. GPUs can help to speed up general purpose computations by executing user defined code. Each shader processor can be seen as a stream processor which takes an input from a stream, applies to it a computational kernel and outputs the result to another stream. Many stream processors means parallel computing. The only GPGPU language that is fully supported and updated up to now is Nvidia’s CUDA-C. ATI-AMD plans to release its own language, too (named “Close To Metal”). So far, the tools that ATI-AMD offers are pretty low level and of limited use. Other efforts include Sh by the University of Waterloo (Ontario, Canada) and Brook by Stanford University (California, USA). These versions are a little more experimental. However, since GPU architecture is different from that of a CPU, these programming languages are expected to have some limitations comparing to traditional programming (not supporting memory writes and branch control, pure instruction set or having limitations on the size of the code, to mention few of them).

Several other systems implemented to work with GPUs, inspired the writer to do something similar. Labatut [16] achieved an overall 200% performance increase on an algorithm for 3D reconstruction from a stereo pair of images, enabling the authors to perform a task within a few minutes. Sinha [28] achieved a 10 to 15 times faster implementation than a normal CPU, of a feature tracking system (based on KLT tracking algorithm and SIFT algorithm for feature extraction) which enabled the authors its use to a real-time environment with video input of 1,024 x 768 resolution. Fung [13], created a parallel GPU Computer Vision System API for Image Processing using Cg and OpenGL. The library includes algorithms for corner, feature, skin tone and colour object tacking. His work has been cited many times, and is considered a reference tool for performing such kind of work. However, his approach is not a GPGPU one, as the data are transformed into textures to be processed by the GPU and it has limited efficiency.

3.4.1 A New Programming Language (CUDA-C) CUDA (Compute Unified Device Architecture) is actually a Software Development

Kit (SDK) based on the C programming language that contains all the necessary libraries to manipulate an Nvidia GPU.

CUDA is composed of several layers. There is a hardware driver, an Application Programming Interface (API) and a set of higher level mathematical libraries (named as CUFFT and CUBLAS). CUBLAS (CUDA Basic Linear Algebra Subprograms) and CUFFT (CUDA Fast Fourier Transformation) libraries can be used without interacting with the CUDA environment. It is integrated for efficiency. CUDA has some advantages over the previous attempts of programming a GPU, which were described in the previous chapter.

27

• Because it is based on C, it has all the advantages of it. One can tell

that is an extension to the C programming language, so no prior knowledge on graphics is needed. Furthermore it is possible to allow C++ and FORTRAN code under certain circumstances.

• Provides a general and easy to use, dynamically allocated DRAM memory addressing that enables to read and write into memory, through the operations of ‘gather’ and ‘scatter’.

• Achieves high parallelism due to graphics cards architecture. • Contains a fast on-chip shared memory (16 KB in size) that can be

shared between different threads and achieves high speeds for small data memory operations. Processing power is not underestimated from waiting memory operations.

• Supports common data types such as 32 bit integers as well as bitwise operations. Newer graphics cards support doubles as well.

• Provides a library with useful algebra and physics functions, such as calculating an FFT or transposing a matrix.

• CUDA comes along with an excellent documentation, many tools and examples.

CUDA, however, has some limitations as well:

• It is compatible only with Nvidia graphics cards. Moreover, differences (data types or functions supported) exist even between graphics cards of the same company.

• No support for recursive functions or static variables. • Threads must be run in groups of (at least) 32. • Texture rendering not supported, provides limited interoperability with

DirectX and OpenGL.

3.4.2 Architecture Each CUDA-enabled graphics card is a set of Single Instruction, Multiple Data

(SIMD) multiprocessors. This means that in any given time the same instruction will be executed in all multiprocessors but in different data. Parallelization is increased as each multiprocessor has a number of streams which can execute instruction in parallel. Furthermore, each multiprocessor contains a set of 32 bit registers, a shared memory (small in size – only 16 Kb – but almost as fast as a register with 2 cycles latency) as well as a read-only constant cache and a texture cache, which actually reside on global memory and are shared by all the processors. Global memory is the main DDR memory of the card. Large in size but slow (300-600 cycles to access). Texture and constant memory perform generally better comparing to global memory because they are cached. Texture memory is good when 2D spatial locality exists and constant memory when many multiprocessors access the same values.

CUDA defines two types of physical devices. Host is the CPU and Compute Device, or simply Device, is the CUDA-enabled graphics card. Both host and device have their own DRAM, referred as host memory and device memory accordingly. Code to be executed on a GPU must be organised as a set of threads that will be run in parallel. The definition of the thread does not change in CUDA specification. So, given a problem that requires a standard computation in lots of different data, instead

28

of applying sequentially different sets of data into one function in the host, all the data is applied at the same time by running that function many times (in a form of several threads) in the device. Each thread is independent and has its own thread ID but can share data and communicate with other threads.

Synchronisation breaking points are also supported. The function is called kernel

and is compiled in real-time and downloaded from the host to the device. This operation, along with any data transfers from and to device is, according to Nvidia, possibly the bottleneck of every CUDA application. The set of threads that will execute the kernel is called a thread block and is identified by a block ID. A thread grid contains several thread blocks. This allows the developer to achieve further parallelism by executing different kernels at the same time. Two threads at the same thread grid but in different thread block cannot communicate.

Figure 3-15: Hardware model. Image taken from [18].

29

Figure 3-16: Thread batching model. Image taken from [18]. A thread executing on the device can read or write into its local registers and local

memory, as well as to its block’s shared memory and its grid’s shared memory. Also, a thread can only read from two other special memory spaces, grid’s constant memory and grid’s texture memory, which are cached as stated before.

Figure 3-17: Memory model. Image taken from [18].

30

A grid of thread blocks will be scheduled to be executed in such way so that each block can be processed by only one multiprocessor, in batches – one after the other. The blocks that are executed in one batch at a given time are known as active. Each active block is split into groups of threads, called warps. The number of the threads that a warp contains is the warp size.

Controlling the host and the device is done programmatically via CUDA. It also offers a common component, which includes all the necessary functions, constants and variable types, too long to be listed here.

3.4.3 Programming Model

Generally, CUDA introduces into C, some function type qualifiers depending on

where those functions will be executed. __device__ denotes that the specific function will be executed in the graphics card, __host__ function will be executed in the CPU and __global__ means that the function is the kernel (executed in the graphics card but is defined in the CPU). Also, apart from any initialisation, error, time, memory allocation, message transfer, type conversion, synchronisation or closing functions that exist, the most innovative characteristic that CUDA introduces into C is the “<<< >>>” operator, which simply invokes the execution of a device function in the device. Its syntax is as follows:

function <<< grid_size, block_size, block_memory >>> (parameters)

where function is the function name to be executed, ‘grid_size ’ is the dimension

and size of the grid (i.e. how many blocks it contains), ‘block_size ’ is the dimension and size of each block (i.e. how many threads are in each block) and ‘block_memory ’ (optional) is the number of bytes in shared memory that each block needs for its execution. ‘parameters ’ are the arguments (if any) that are needed from the function. All the mentioned functions can be accessed even more easily by a third library, developed again from Nvidia, called “cutil”. cutil is not a part of the CUDA, but helps keeping the code more tidy avoiding some low level control.

Similar qualifiers exist for the variables, depending on where they reside. So we have __shared__ for variables that exist in shared memory, __constant__ for those on constant memory and __device__ (default) for those in global memory. CUDA does not allow direct manipulation of the registers but is possible to control how many registers are used for an execution of a kernel with practises that we will see on chapter 4 and the code optimization procedure. Any other aspect programming model remains unchanged supporting the usual data types, data structures, mathematical functions (with different names, though) or control flow instructions. Along with some built-in data types (mostly for vectors), variables (identifying threads and control flow) and various casting or time functions it consists a fully functional SDK.

General speaking, programming structure is not so different than conventional programming. It has the same general idea of allocating memory, making operations, copy results and de-allocating memory.

CUDA’s compiler nvcc works fine with both Linux and Windows. CUDA source files (.cu), after a pre-processing are compiled into C files and then are processed by the default C compiler. Files that contain device code will be managed differently (translated into object with ‘.gpu’ extension) by those that contain the host code (translated into object with ‘.o’ extension). Both object file will be included on the same file in the end. Compiling a CUDA program is possible without having an

31

appropriate graphics card. Nvidia provides an emulator (Device Emulation Mode) where each thread will be executed at the CPU sequentially.

3.4.4 Memory Issues

As it can be easily understood, Nvidia’s proposed memory model is quite complex

offering a lot of types of memory with different performance. Excluding the cached types (constant and texture memory) there are some known issues to shared memory and global memory that a programmer must know in order to achieve performance with the least effort. 3.4.4.1 Bank Conflicts

As it has already been stated, shared memory achieves very high performance

because it is located on the chip. Shared memory is organized into equally sized banks. When two or more threads try to access the same bank (this can mean that they access different addresses but on the same bank) on the shared memory, the GPU is forced to serialize the requests, reducing performance. So, it is vital to understand how shared memory is organized and the best way to do so is with the help of an example.

For the current existing graphics cards the number of maximum banks supported is fixed and it is 16 and warp size is 32. Supposing that we have declared a char array on shared memory of size 32, and the size of a word is 32 bits, the array is split into 8 sequential banks containing 4 characters (bytes) each one. If the programming logic accesses data sequentially like:

data = shared[start_address + thread_id];

it will result in bank conflict, as shared[0], shared[1], shared[2] and shared[3] are on

the same bank. A bank conflict won’t happen if the same data accessed in the following way:

data = shared[start_address * 4 + thread_id];

The easiest way to avoid bank conflicts is to always declare an odd number for

shared memory size.

3.4.4.2 Memory Coalescing The advantage of global memory is its size. It easily starts from 256 MB and

reaches up to 1 GB, depending on the graphics card model. However, performance is poor. However, there are certain rules proposed by Nvidia that can improve performance and refer on correct data organization during load and write operations from/to global memory.

Firstly, the data must be loaded and written on a single load/write instruction. That is why, they must be properly aligned. Any given data_type must be such that sizeof(data_type) is equal to 6, 8 or 16. So any variable of type data_type must have their address a multiple of sizeof(data_type) . Structures have to be aligned manually. CUDA devices are able to handle 32-bit, 64-bit and 128-bit words.

32

Secondly, the data that reside on global memory should be arranged adequately so that each thread that requires data from different addresses, can acquire them from a single contiguous and aligned memory region.

3.5 Discussion & Conclusion In this chapter all the necessary background in both theory (research) and practical

(hardware and software tools) was presented. The reader should have a fairly clear idea of the problem in its entity.

The main point of this chapter was to point out the importance of Balasuriya’s work on the field. By solving the previous attempts’ problems, his work is worth of improving. His system is a full application that implements a lot of Computer Vision tasks that result in a full recognition system, which is actually the future of the Computer Vision research field and sets a basis for further development. And the principle drawback on his implementation is the performance. So, accelerating his implementation will make it even better.

Secondly, the reader should understand the processing power that latest graphics offer and how it is possible to benefit from them. Cheap and powerful hardware is very important and open new horizons in Computing Science. The parallelism that they offer is another advantage, as they provide a human brain-like operational model that developers are already familiar with such programming techniques. The details that were provided for the CUDA programming are very important for a programmer who wants to develop an application. It’s useful to know small details that will increase performance without optimization. Just with careful designing. Furthermore, one must not forget that an explosion in processing power is about to happen with all the unleashing of these parallel technologies. Researchers come closer to a long-waiting dream. To see all the algorithms in Computer Vision to perform in real-time. This would certainly broaden the horizons and pen new research areas.

Having understood the problem and researched through all the previous work and approaches, the writer will present his way of tackling the solution in the following chapter.

♦

33

Chapter 4

Implementation

In this chapter the design of the system is presented and explained why it is better than other possible approaches. Implementation details will be presented as well, giving information on how the author tried to solve the problem step by step, what tools or algorithms he used and what implications he met during the development. Finally, the outcome of the overall work and any results are presented.

4.1 Functional Specification The programming outcome of the present work is a library, which performs the

basic operations of sampling an image and reconstructing an image vector, as described on previous chapters. Two methods must be available for the image reconstruction, where a comparison between them will be made. The one will involve Voronoi regions and the other Gaussian kernels. The design must be simple, careful and generic so that it ensures that it works with any kind of input data without problems. Part of this constraint from the retina point of view, is the need of constructing pyramids on the GPU, which usually will have to deal with more demanding data as kernels grow. From the image point of view, is the need of processing real-time video input. A general overview diagram of the structure can be viewed on the following diagram (4.1):

Figure 4-1: The main architecture and data flow of a biologically-motivated application. ‘Edit’ step comprises all the image processing that could be applied but in terms of

the present project, none have been implemented yet. ‘Sample’ and ‘Reconstruct’ steps are applicable for the pyramids as well.

Last but not least, a basic I/O functions and a good interaction with the user (in terms of visual display capabilities) must be provided. One must bear in mind that the library is designed for programmers, so the design should be appropriate in terms of ease of use.

34

4.2 Non-Functional Specification The library was developed under Windows 32bit environment, but it was compiled

and tested on 64bit version as well. Since CUDA is available for Linux, it will also be portable on Linux without many changes. The primary objective is to provide an idea of biological motivated vision to the user. Other functions may be added later in order to broaden the features of the library, so it has to be extendable in terms of used data structures. Ideally, the library will be applicable to any kind of image data, both color an grayscale. Finally, the outcome of the library’s functions (either the sampled or reconstructed version of an image) must be editable by any other software or library that supports similar functionality (including the Matlab vision system). In our case, images and image vectors must be in a known format standard. In any other case, converters must be provided. The time interval under which all the functions must be completed is critical as well.

4.3 Fields to Focus

• Retina Storage. Managing a retina created in Matlab code and used in C, is an issue to the present

project. From the various different retina models that were created sizes were big enough (varying from 913 KB and reaching up to 175 MB sometimes), especially when large kernels existed. For a more detailed view on sizes see figure 4-2. Judging from the results presented at [1], the best retina model seems to be the one with 20% fovea region, after 20.000 iterations but, this depends on the needs over image quality, retina size or execution speed. The problem is to choose how it will be saved and loaded. The retina after sorting it according to eccentricity, it will be saved sequentially in a binary file, and loaded into a floating-point array. So, I/O functions must be provided. New systems and high-level programming languages do not have problems when such a big address space is needed for an object. After all, speed is number one priority in this project. We must also bear in mind that inside a retina, apart from the sampling points, information about the receptive fields will be contained. Using binary files instead of ASCII may add complexity but increases dramatically performance as has been noted, compared to other implementations (for example, see [11]).

35

File Sizes for Some Generated Retinas

Retina Resolution (no. of nodes) Size (KB)

512 2,029

2,048 7,816

4,096 10,990

8,192 22,294

16,384 54,009

layer 512 25,204

layer 2,048 30,096

logz 2,300 14,971

logz 4,096 44,566

Figure 4-2: Sample file sizes for some of the generated retinas (including pyramid layers).

• Overlapping Areas. The simplest model for a retina would be the one where its receptive fields do not

overlap with each other. This is because, in order to calculate the response there is independency of data and it can be done easily in parallel, since there are no shared data. However, quality rises dramatically when overlapping receptive fields exist because the response is more accurate. This happens to satisfy the Nyquist criterion and order to achieve continuity of the information. Though, this creates a more blurred image but has the advantage of immunity to noise. So, it is vital to find a way on how the data needed for calculations are transferred into the GPU. Certainly, there will are cases that the image is not small enough to store it into the shared memory of the GPU. This might be a bottleneck for the designed system, since the processing of some threads might be halted until the data arrive. In order to reduce the data transfers between the host and the device as first design method, it is assumed that all data reside on the global memory. It may be not optimal, but overall is still faster than any other implementation (compared to [1] and [11]). It would be useful to note here that asynchronous data transfers are not currently supported from CUDA. • Distributing Workload.

Having a large number of processors that can execute an even larger number of threads at the same time, a careful program design is required to take advantage of it. This is also a part of the optimization procedure. Knowing that a device has certain memory and certain processor streams devoted to each multiprocessor the job is to find what is the minimum execution time for a given combination of threads executed to a number of multiprocessors (blocks). This can only be done through testing and empirical evaluation.

• CUDA-related issues.

As described on previous paragraphs, Nvidia’s graphics card architecture is complex, so a lot of factors play an important role on the performance of a program. As it applies to any CUDA program, here as well, we have to take into consideration how to avoid bank conflicts and non-coalesced memory access. On the other hand, as Nvidia notes [18], any CUDA programmer must strive to reduce the data that are

36

transferred between the host and the device; even if more computations are needed to reproduce them.

4.4 Design

4.4.1 Sampling

In order to sample an image a straight-forward solution was chosen that gives a kernel-independent aspect to the process. As can be seen at figure 4-3 a single thread is used to convolve a kernel with an area of the image. The result is stored directly to the image vector with the appropriate index. Devoting a thread for each node of the retina is probably the fastest way to complete the result. Moreover, it can easily support any kind of kernel, with respect of its size.

Figure 4-3: Sampling design of the HART library.

Initially one may think that there are limitations on how many threads can be

launched and executed at the same time, but the truth is that there is no problem at all at least in the terms of the current project. A CUDA device can easily support hundreds thousands of threads and as it is explained in a later section our library does not use more than these.

Siebert at [25] described a method where apart from the input image, all the kernels exist in a second image, called “kernel coefficient image”. For simplicity, an assumption is made that there are no overlapping kernel areas. If they do, several kernel coefficient images will be used in such way, that each one will contain non-overlapping areas as well and will simply add the information to the previous result. Multiplying the two images will result in a product image vector, which actually contains the receptive fields’ responses from the specific retina. Balasuriya [1] did the same, using built-in functions of Matlab. For the current problem, there’s no need to add such complexity by calculating the coefficient images. Moreover this method is not applicable to our retina data, which are stored in a linked list format.

37

4.4.2 Reconstruction

First of all, it has to been mentioned that two algorithms were implemented for back projection of an image vector. Though, only the algorithm is different – the design (as seen on figure 4-4) remains the same. To achieve this, a well known and long tested practice in Image Processing has been adapted. That is, to split the output image on many tiles and assign each thread to a tile. Each thread will compute pixel-by-pixel its value taking into consideration the previously generated responses (the image vector), the retina’s kernel coefficients and any overlapping fields. Calculations for some pixels may not be needed because it is possible for the specific pixels to be outside the fovea’s field of view or the retina to be too small in size to take account those pixels. A graph describing the notion can be seen on figure 4-4.

Figure 4-4: Reconstruction design of the HART library.

The advantage of this practice is that it can deal efficiently large images. And this is

a need in our case. The user can change the size of a tile and decide which one is the optimal for his needs. A direct solution, where each thread will have to map each vector’s value to all the pixels that reside under a kernel (described by Siebert [25], implemented at [1]) is not applicable to our problem. Even if it is more efficient, it will result into race conditions between threads that try to access the same areas (due to overlapping receptive fields) and so it would decrease performance and would generate artifacts as some values may be written twice and others not at all. Such artifacts were generated by the author when trying to find an optimal solution for the problem and can be seen on figure 4-5. Currently, CUDA does not support controlled access to global memory (it supports it though on shared memory via the __syncthreads() function). It is important to understand that at any given time it is not possible to update the value of a pixel more than once. Moreover, on this way it’s not possible to apply any normalization techniques as information from multiple receptive fields is needed at the same time.

38

Figure 4-5: Generated artifacts due to race conditions among different threads.

Another possible solution for the problem would be the CUDA embedded atomic

functions. These functions perform simple memory loading and storing operations (as well some more complicated functions) that take one clock cycle to complete, allowing unique access to certain memory addresses. However, currently they support only limited data types (int32 and float32) and that is why they cannot be used in our case (HART uses single bytes, to save memory space). Future plans of CUDA have been announced to support atomic functions of any data type.

Results from the reconstruction methods can be seen on section 4.7.1. 4.4.3 Pyramids

Image pyramids are really important to be supported by any vision system, as they

are needed in many algorithms on Image Processing. The present library could not have omitted them. Balasuriya [1] used cortical filters to create a pyramid by using as input the responses of the immediate previous finer level. Without underestimating the importance of cortical filters, this method is potentially inaccurate since each layer is heavily dependent from previous computations (which, in a way are simply “estimations”), as well as inefficient because new functions and new data structures are required to manage the layers. This leads to a complete division between the sampled image and all its pyramids. Balasuriya, is believed to having follow this method in order to reduce processing time. In the present project, since execution times are meant to be small it has been tried to vanish the distinction gap between original retina and pyramid layers. This is done by creating retinas where the minimum distance between receptive fields (minrf) grows bigger as we move to coarser layers. The main idea (as seen also at figure 4-6), is to handle pyramid layers as any other retina, using the same data types and the same functions for sampling/reconstruction.

39

Figure 4-6: Pyramid creation design of the HART library. Without affecting execution times, the pyramid layers sample directly the image

without depending on previous results. This method produces more trustful results in a more efficient manner, because the each pyramid sampling takes place directly to the original image data. Moreover, it is now possible to reconstruct any pyramid layer by using only its response and without the information of coarser layers. On the original vision system in Matalab it’s not possible to reconstruct pyramid layers using Voronoi regions. The only disadvantage is that the newly generated retinas will use larger kernels (having fewer nodes to sample the same area) and this will result in larger files and memory consumption. However, as graphics cards technology advances quickly, it will not be a limitation for a long time. Below on figure 4-7, the reader can see the minimum receptive fields’ distance that was used for this project to generate the pyramid layers from an 8.192 node finer level retina.

Receptive Fields’ Minimum Distance for Pyramid Laye rs’ Creation

Layer Size (no. of nodes) min rf (pixels)

4,096 2.0

2,048 3.0

1,024 4.0

512 6.5

256 9.0

Figure 4-7: Receptive fields’ minimum distance for pyramid layers’ creation for a finer retina of

8.192 node resolution.

40

Results from the pyramid layers reconstruction can be seen on figures 4-15 (normalized Gauss) and 4-16 (Voronoi).

4.5 Basic Algorithms

As has already been described, the bottleneck of a biologically designed system is the computational cost of sampling and reconstructing the image. Taking advantage of the parallelism of recent GPUs is not enough if the algorithm itself is not efficient.

The algorithm that was used for this project to sample an image and generate an image vector is the one described on chapter 3.2.4.1.

The algorithm that was used for this project to reconstruct an image from an image vector according to the Gaussian method is the one described on chapter 3.2.4.2 but with a small change, in order to normalize the values. Instead of using a buffer accumulating each pixel’s value by multiplying the response with the respective kernel’s coefficient, two buffers were used instead, with the second one accumulating only the kernel’s coefficient. In the end the first buffer is divided by the second and the result is written directly to the pixel. Results that show the importance of normalization can be seen on figure 4-12. Given an image area I, its retinal responses R and the kernels from a retina with m nodes, the formula describing the above is:

∑

∑

=

=

×←

m

jjii

m

jjii

ii

YXG

YXGiR

YroundXroundI

1

1

),,(

),()(

))(),((σ

σ,

∈∀i support region of any receptive field

For the Voronoi region based reconstruction the same method as [1] was used.

Voronoi regions are one of the most important diagrams in Geometry, and are widely used in many sciences. Theoretically, according to [37], a Voronoi region is “a polygon whose interior consists of all points in the plane which are closer to a particular lattice point than to any other”. So, taking each node of a retina as the center of each polygon a region around is point is created by containing the area closest to that point. In our case, depending on the current width and height of each pixel the minimum of the responses is taken as a result. The same operation applies to every pixel. An example of the Voronoi reconstruction using the same data as on figure 4-12, can been seen on figure 4-13.

Of course, when RGB images exist, all the above pixel-wise operations must take place three times – one for each color channel.

4.6 Implementation Notes

On this paragraph all the necessary details will be given in order for the reader to understand how the problem was tackled as well as some specialized programming techniques on CUDA programming that we help future implementations become more efficient.

41

4.6.1 On HART Library The output software product of this project is a Dynamic-Link Library (.dll) called

HART (Hardware Artificial Retina Transformation). Dll files allow to a programmer to call them directly from any programming platform (including Matlab). They are well researched and work without further implications, as long as the data types used are the same. Coming to the used data types, everything is programmed using pointers. Some new data structures were developed to improve functionality without affecting compatibility. Nowadays, almost every known imaging library use short integers (=2 bytes) to store pixels. In this project simple bytes were used, in order to reduce memory space and execution time due to the limited bandwidth between the CPU and the GPU. The only data type in C that has the size of a byte is char (value range 0-254). The only perquisite for the new data structures was to save as much memory space as possible. Image Vectors are actually an array of floating point values with the same size as the number of the receptive fields provided (named as imvector ). The rest of the properties are directly copied form the respective image along with its size. The retina is organized in a linked-list format by storing the (x, y) sampling point on the image, the size of the kernel and a pointer to the kernel itself. The retina by itself is an array of that structure (named sampoint ), describing the various receptive fields. Retinas are outputted sequentially from the Matlab based vision system, using binary files. Below the reader can see how the new data types are defined (figure 4-8): struct image struct sampoint struct imvector { { { char* filename; float x; ch ar* filename; short int channels; float y; un signed int size; unsigned int width; int stride; un signed int width; unsigned int height; float* kernel; un signed int height; pixel* pixels; } ; short int channels; } ; float* vector; } ;

Figure 4-8: Data structures of the HART library.

Coming to the functions that form the HART library, it should be mentioned that

they are divided into modules. This division is only theoretically as C is not an object-oriented programming language and does not support entities or packages. Below, the reader can see all the functions, organized conceptually for better understanding (figure 4-9):

42

Figure 4-9: General overview of the HART library.

All the functions are divided into 5 categories. The Main Core’s functions are two.

The first function, Sample() , which samples an image given a retina model, its receptive fields and the image sampled returning a pointer to an imvector variable. The same function is used for the creation of pyramids. Currently, for static images only PGM (Portable GrayMap, grayscale images) and PPM (Portable PixMap, color images) formats are supported. This can be changed easily as there are currently many free converters that allow reading of most formats.

The second function, Reconstruct() , which takes a sampled image (in form of imvector ) and the retina model, reconstructs the original image and returns a pointer to that. New functions can be created that will allow further manipulation of an imvector variable. Reconstructed images, or any other image can be viewed with the graphics module and the function Display_Image() , without using any third party libraries. The function Display_Image() is based on OpenGL which, according to the author’s own opinion, is easier to use than Micorsoft’s DirectX as it uses a simple and straightforward API. Furthermore it’s platform independent and very well documented and supported by all the major graphics cards manufacturers. After uncompressing the image data they always are the same no matter the format. In order to use OpenGL a well known and tested wrapper was used. GLUT is recognized by the OpenGL Symposium as the official library to manipulate OpenGL related functions in an event-directed manner. Its current SDK version is 3.7. However there are some disadvantages in the official version. Firstly, when application’s main function falls into the main display loop the control of the application can never return back to the main function. And secondly, when the user exits the display window the whole application is forced to be terminated as well. These facts put some serious limitations in good interactivity of the program and it is considered as a bad design for good human-computer interaction. To solve these problems, a modified version has been created by Fletcher [40] that overcomes these limitations. HART library uses this modified version as well, allowing multiple display windows.

Images are loaded and saved by the I/O module. Images are saved once again only in either of PGM or PPM file formats. For the retina and imvector types, two new file types were created in order to store them on the hard disk. They are the .RET (for

43

the retina) and .VEC (for the image vector) file types. Data are saved, again, sequentially and in binary mode. The format of a vector file is actually the same of an image but instead of pixel values we have retina responses. So, the resulting files are by far smaller than the original images (for example, 8 times smaller on a 512 x 512 color image). The truth is that PGM and PPM in order to increase compatibility with any platform, generate large files but compression is achieved either way. A simple example would be to sample a 512 x 512 grayscale image with an ideal retina of 36,000 nodes. In order to perform any operation on the image we have to process separately 262,144 pixels. With the use of an artificial retina, there are only 36,000 “pixels” (meaning, responses).

Module Initialization actually includes the constructors for the two basic data types of the library (image and imvector ) and an initialization function that must be preceded before any further calculations with HART library as it contains very important initializations for setting-up the graphics card and OpenGL for graphics.

Finally, the auxiliary module includes all the functions that are not directly connected with the HART library but show the possibilities of expanding it. Currently, only one function has been developed (Camera() ) which provides real video input feed to the library. For the manipulation of the camera the OpenCV library is used (see [38] for more). The Camera function interacts with the user, as the fixation point can be changed dynamically during execution by using the mouse pointing device. This is to demonstrate how a retina can “rove” exactly as a human eye does, in order to focus on different points. Query_Device() is a quite useful function that informs the user about certain capabilities of his graphics card.

All these functions are known as host functions because they reside and are executed on the normal CPU. These functions however make calls to the device functions (also known as kernels) that actually do the entire job of sampling/reconstruction on the GPU. Those functions reside and are executed in parallel on the GPU. Below, on figure 4-10, the reader can see how they are categorized:

Figure 4-10: Overview of HART functions executed on the GPU.

The last letter of every function name denotes its version, and there are two. ‘G’

stands from grayscale data (1 array) and ‘3’ stands for RGB data (3 arrays). Nothing changes in the either the design or implementation between the two versions. So, doSample() performs the sampling of an image and is callable by the host function Sample() and doReconstructV() , doReconstructG() the back-projection of an image vector and are both callable from the host function Reconstruct() . Letters ‘V’ and ‘G’ stand for Voronoi and Gaussian method respectively.

44

Capturing the original requirement for simplicity and ease of use, all data types and function names are self explanatory and no programmer should find difficulties on using the library or understanding the input arguments even without previous experience on C. For detailed information about the HART library, the code itself as well as a ‘Read Me file’ are available on the accompanied CD.

4.6.2 On CUDA Programming

Programming with CUDA proved to be easier than expected but there are many parts that require special attention if maximum performance is to be achieved. On the present paragraph the author will write some useful conclusions that were made during his own experience with CUDA.

First of all, it is vital to know the physical properties of a graphics card that is about to be used. Query_Device() function of the HART library provides all the necessary information. The most important things are: the number of multiprocessors, the amount of global memory, the warp size and the maximum number available per block. However, the maximum size of blocks or threads or the amount of shared memory is equally important but those properties remain more or less unchanged between different models. After writing an initial version of program that executes correctly it is time to optimize it and test it with various parameters. Nvidia translates optimization to occupancy, which means how much the graphics card’s resources are used. Of course, 100% occupancy is the best. However this is almost never possible due to numerous limitations that are outside control of the programmer. Some basic guidelines for a program that would lead to good designed code are:

• Total number of blocks should be at least equal or greater to the number of multiprocessors.

• Total number of threads should be a multiple of the warp size. • Accesses global memory properly, according to paragraph 3.4.4.2. • Avoids bank conflicts as described on paragraph 3.4.4.1, if shared

memory is used. • Avoids thread synchronization when not absolutely needed. • Has as few control statements as possible. Graphics processor is not like

a CPU as explained in paragraph 3.4.

To achieve good execution times, one good solution is to benchmark with different thread/block sizes for every kernel. This is a common practice that is absolutely trustful. Alternatively, Nvidia provides a tool – the CUDA Occupancy Calculator Tool (included in the CUDA SDK) – which informs the user about the occupancy of each kernel on the graphics card by varying number of threads, registers or amount of shared memory. However, this tool for the present project was not so trustful because execution times depend heavily from the input data and not all threads execute the same operations (on reconstruction a thread may process a pixel if it’s inside the field of view or not). The results from this project’s benchmark are presented on the next chapter.

When comes to programming, it’s useful to use loop unrolling (by using the precompiler directive # pragma unroll ) for loop structures that perform the same operation. Loops are used exclusively in this project and loop unrolling helped the compiler to allocate registers in a better way. After version 2.0 of CUDA, the compiler unrolls all the loops, by default.

45

On the other hand the two main limitations that can cause a kernel to fail execution are:

• Number of registers. Application launches with so many threads that registers are not enough.

• Amount of memory. When the kernel allocates more memory than it is allowed.

To deal with these two problems the author made some design choices. For the

register count (which is the most important) an optimization procedure was followed in order to minimize the number as much as possible. Though, numerous tests and several re-writings of the same code were needed before reaching a safe conclusion. As a rule, the author concluded that the number of declared variables is completely irrelevant with the number of registers. On the contrary it sometimes decreases their use. Secondly, it is thought that use of shared memory leads to fewer registers. This is wrong as well. Shared memory will only provide faster loading times. In the end all values have to be loaded into registers to perform a calculation. So, the only way to reduce the registers is the simplicity of calculations. By this, it is meant that calculations have to take place gradually and not massive as it may happen in normal CPU programming. This is easily explained with an example. Given the following code:

for (int i = start + index; i < end + index * 3; i+ +) do calculations;

It is better to re-write it in the following way:

start += index; end += index; end *= 3; for (int i = start; i < end; i++) do calculations;

This technique has more effect when used inside loops and is applied everywhere.

The author managed to reduce register count by 3 or 4 by following this notion. The compiler has an option to force compiling the program using a certain number of registers (-maxregcount) but it’s pretty inefficient because it will use local memory instead of registers and the result will be very slow. Reducing the number of registers will give the possibility to run more threads, achieving more parallelization. CUDA compiler version 2.0, even in beta version, is much better handling registers than any other previous release.

For the limited sized memory problem the most serious problem comes with shared memory (only 16 KB in size). The author decided not to use the shared memory at all, except from storing the functions’ arguments. The amounts of data that the library handles cause the shared memory to be unusable. In a different occasion this would put serious limitations on parallel launching of threads, decreasing performance. On the other hand continuous storing/loading functions with it, even if it’s fast it would create complicated code and would require many synchronization points among the threads. Moreover there’s no need to worry about bank conflicts. So, as a solution a simple design was kept. All the data needed reside on global memory and when it’s useful the cached memory regions (constant and texture) are used that provide better performance. So constant memory is used extensively for global variables (for

46

example, fixation point or number of receptive fields) and texture memory for the store of an image on the graphics card. Correct access to the global memory (as described in paragraph 3.4.4.2) is provided by aligning (with the __align__ preprocessor keyword) the new data structures (image , retina and imvector ). The idea of keeping the retina on a linked list is useful because each thread accesses only a contiguous memory space, which applies to the coalescing rules. However, this not happens on reconstruction methods as information from different receptive fields is needed and the kernel accesses memory almost randomly.

As the reader will see on chapter 5 execution times are very good, so using global memory does not have any serious hit on the performance for the needs of the current problem. On the contrary it enables the library to work with any retina or image size without implications. Putting the rest of the rules into practice, the author crated a small algorithm before every execution of the kernel that will try and optimize the [blocks, threads] numbers so the programmer doesn’t have to worry about those details. As explained on the next chapter, a special optimization procedure has been followed for some specific retina sizes. For the rest, it will prevent kernel from crashing.

At last, following Nvidia’s recommendations about reducing the data transferred between GPU and CPU, an initialization procedure takes place before the invocation of each kernel. Initializations times are not big, as can been seen on the next chapter and hopefully, performance is increased by following this practice. For example, kernels’ coefficients are not needed for the Voronoi reconstruction or kernels’ size information is not needed by the Gaussian reconstruction, as implemented on this project. Moreover, memory space is saved for possible other needs.

4.7 Results

The only way to visualize an image vector is to reconstruct the image by using the Reconstruct() function described above. This is the only way to project an image vector into an understandable format. These are the results that are presented on this chapter, as well. All images are results using the HART library and the standard Lena color image (see figure 5-1) with a retina of 8,192 nodes, unless otherwise stated. 4.7.1 Reconstructed Images

Below on figure 4-11 is the Gaussian reconstructed image that was generated by the author before applying normalization. Artifacts (as described in section 4.4.2) do not appear because the program was executed in emulation mode, where all threads execute sequentially in the CPU.

47

Figure 4-11: Gaussian reconstruction without normalization.

And now, is the image reconstruction result with normalization applied:

Figure 4-12: Gaussian reconstruction with normalization.

As can be seen from the figure 4-12, result images are of better quality and smoother comparing to that of 4-11. Moreover there is no need for values’ scaling as it was necessary with the previous formula, due to the large generated numbers.

On figure 4-13, it can be seen the same result but using he Voronoi region method.

48

Figure 4-13: Voronoi reconstruction. Differences may not be so visible because images have been scaled down to fit the

page. Finally, the reader has the chance to see the behavior of logz models retina tessellations (counting 4.096 nodes) below, on figure 4-14.

Figure 4-14: Logz tessellation reconstruction. Both Gaussian (left) and Voronoi (right) based.

It is obvious that results are not as good as with normal retina tessellations (see

figures 4-12 and 4-13). Image detail is low and there is no pixel continuity in the center of the retina. Also, it can be said that Voronoi reconstruction is not applicable, as the result transforms the original image a lot. Even, the author didn’t have the chance to test logz+α tessellations they are not expected to perform better.

49

4.7.2 Reconstructed Pyramid Layers

In the present paragraph, the reconstructed of some retina layers are presented using both the Gaussian (figure 4-15) and the Voronoi (figure 4-16) methods. Three layers are presented in total (using 4.096, 2.048 and 1.024 nodes respectively), plus the original reconstruction in the highest resolution (8.192 nodes). Of course, more retina layers have been created and they are available on the accompanied CD.

Figure 4-15: Pyramid layers normalized Gaussian reconstruction using 8,192 (upper left), 4,096 (upper right), 2,048 (down left) and 1,024 (down right) nodes.

50

And here is the Voronoi version of the above pictures:

Figure 4-16: Pyramid layers Voronoi reconstruction using 8,192 (upper left), 4,096 (upper right), 2,048 (down left) and 1,024 (down right) nodes.

Any comments that were made for the reconstructed images, apply to the layers reconstruction as well. On both cases it can be noted how quality decreases as number of nodes is decreasing, which is something totally expected.

4.8 Discussion & Conclusion

In this chapter, the author explained some basic notions behind the library’s design as well as any functions and data structures that were used. Also, some basic implementation problems were discussed and any design choices were justified.

As already explained, the general idea of the sampling and reconstruction functions is to maximize their compatibility, no matter the input data. The design is kept simple, in order to avoid any unwanted problems that probably a solution might not exist.

51

CUDA as a development environment is relatively new (first released on the beginning of 2007), and yet, there is no any stable version that can promise no implications at all. Nvidia’s practice is to release beta versions and in cooperation with universities, research groups or even individual people to help improve it. Moreover, up to now, there is no any document explaining how the compiler works or how the memory is organized internally. The initial design was kept during the whole development procedure. For example, sampling the Lena image with an 8,192-node retina, current design allows programs to run by launching 8.192 threads for sampling and 16.384 threads for reconstruction. This indicates that CUDA is keen on parallelization. The more the treads, the better for the execution times. If computations inside the kernels were more complicated (by managing more data) then there would be fewer threads. Moreover, there is a problem of resources management. More complex threads mean more memory and register usage which can put further limitations on the number of threads per block. Then, more blocks must be created and if the blocks number is more than the number of multiprocessors in the graphics card, they will have to wait. Having lightweight kernels executed on many threads, is the best practice for CUDA programming. This is what was tried to be achieved in this project. This is easily understandable for everyone that has read the previous paragraphs. For sampling, one thread is used for each sampling point and for reconstruction the smallest of the tile size is used. The resource demands of each kernel are presented on the following chapter.

Results are better for the Gaussian projection comparing to that of Matalab’s (where no normalization rules apply). Moreover, it can be noted that intensity is better distributed. No differences are noted on the Voronoi method, though. Generally speaking, Gaussian reconstruction is better and more real-like than the Voronoi regions. However, Voronoi regions can be preferred sometimes as it produces more sharpness. Finally, logz retina models prove to be insufficient both in quality (images have less detail) and in execution time. This is justified from the fact that sampling points have large distance between them, so kernels tend to be bigger. This increases time execution and smoothness of the result.

The advantage of using OpenGL for displaying images is that the user has the ability to resize the image and the display area will automatically adjust to the new dimensions, using linear transformations. Furthermore it adds many interactivity issues, such as, allowing the retina to rove when processing real-time input by simple moving the mouse.

Even if the development time for this project was limited, none problems were found during the testing, no matter the input data. This ensures the overall quality and good functionality guarantee for the library. Though, it’s true that the author didn’t had the time to completely test the current design in comparison with others. Some theoretically justified assumptions were made to support the current implementation.

♦

52

Chapter 5

Validation

In the current chapter all the results from the validation procedure are presented and compared with other implementations as well as assessed. The author concludes on the correctness of the results and whether the implementation meets the original demands in terms of data quality and speed. The validation procedure as well as the reference test machine is as described previously on paragraph 2.4.

5.1 Validation Process 5.1.1 Measurable Variables

Upon completion of the development, the library must be tested to ensure its correct

functionality and the quality of the produced results. During the testing, a number of variables were set, in order to know what is measured and how it is assessed so that it will be possible for the author to reach some conclusions with supporting evidence.

Three variables have been defined: • Speed.

It is measured in milliseconds and indicates how much time is needed to complete a full RC transformation, or any function independently. Elapsed time is measured easily through programmatic high accuracy timers (measuring in clock cycles) that are available in the majority of the high-level programming languages. In this project, in order to be more objective, a timer from the Windows kernel was used (QueryPerformanceCounter , defined in <windows.h> ). If time is not what is expected, dropped frames will be noticed while processing real-time input. Various optimizations or redesigning of parts of the code was necessary in order to reduce timings. Of course measurements vary depending on the data. This is why tests have been organized in groups depending on the size of retina and the size of the image. To reduce any uncontrolled factors, an iteration of 10 times for each function was made and then the mean execution time was taken as final result.

• Responses accuracy.

This is a way to check whether the implementation is functionally correct. The only way to ensure this is only by comparing our result values with the already existing Matlab software implementation [2, 4]. Accuracy is absolutely vital in this project, since the result obtained will be used for other reasons later and will be processed by other programs or libraries. A mistake in the current phase will be simply accumulated in later stages, resulting to unpredictable behavior. Inaccuracies on response values will result in visual artifacts and usually appear when there are computational mistakes. This is why the problem was tackled gradually and in steps.

53

As a measurement standard the Root Mean Squared Error was chosen. Root Mean Squared Error as defined in [39] is a mean of measuring the differences between values predicted by a model or an estimator and the values actually observed from the thing modeled or estimated. So given that the estimator data is iY and the observed

data is iY , the formula is:

∑=

−=n

iiii YY

nYRMSE

1

2)ˆ(1

)ˆ(

The RMSE is used to calculate the average magnitude of the error and it’s originally

derived form Statistics. Graphs based on the simple difference of the values are also provided.

• Image quality.

Since, this project involves a lot of image processing; the first assumptions were made by empirical evaluation of the projected pictures (especially when reconstructing the original image from the sampled one). Image quality is defined here in terms of smoothness and information continuity throughout the whole image area. Certainly, our own eyes are the quickest way to make the very first judgments on the results that we get. Then, we can subtract the retinal image from the image reconstructed from the original Matlab vision system and comment on the differences. If two images are identical, the result should be a totally black picture.

In any case, the reader must bear in mind that during all tests the purpose is not simply to see whether the current implementation is working but also how it copes with large input data (i.e. high resolution pictures). Also, we must ensure that tests were made using same input data and same retina models between the developed library and the original Matlab vision system.

5.1.2 Experiments Description

The experimental procedure involved two phases. Firstly the system was tested in

static images of various resolutions in order to see whether the results of sampled images are correct. The number of the pictures that was tested was 10, with resolutions varying from 512 x 512 to 2,048 x 1,364, to ensure good compatibility. Apart from different size of images, the library must be tested with different retinas as well. A total of 17 different retina tessellations were created (including layers for pyramid creation). Detailed times are presented for 10 of them. The generated retinas differ primarily on the number on nodes (from 64 to 16,384), fovea region (7% to 30%) and the algorithm of creation (based on self-similar networks and logz modeling). The original Matlab code does not allow creation of tessellations larger than 16,384 due to memory problems. Every test was iterated 10 times before retrieving the result, to reduce unwanted factors that affect the timing. When execution times are measured and compared the well known 512 x 512 pixel image of Lena was used (see figure 2-1). Results that are presented are sampled with an 8,192 node retina, unless otherwise stated. To help the procedure, programming scripts were written in Matlab that allow the library to communicate with a Matlab vision system. To compare the results with the Matlab, the produced retinas were saved into a file and then loaded back to Matlab in order to ensure that the two programs used exactly

54

the same data. The retina was fixated upon the centre of the image, in all of the cases.Also, two simple programs were written in C, firstly to give an example of how to use the library and secondly to speed-up the process of testing as it is possible to change execution parameters without having to recompile the library. The programs involve the coding for performing the RC transformation on a single image and on a series of frames acquired from a video input.

Secondly, with the help of a webcam, the system was tested on its ability to manage real-time input. A third demo program was written for this purpose that takes input any camera (webcam, firewire or commercial DV), samples the image, projects it back to a new image and displays it. By measuring the frames per second (fps) we were able to judge whether the design meets the original expectations, in terms of speed and image quality. Frames per second were counted programmatically with a simple idea. Every second it is checked how many frames have been sampled and reconstructed successfully and this number is multiplied by 1000 and divided by the total elapsed time. One can think that the elapsed time is always one second (1000 milliseconds) but this is not true as the processing may take a little bit longer until to reach the control statement. So:

elapsedt

framesfps

1000×=

Figure 5-1: The standard Lena image that was used (among others) during tests.

The project was developed in Windows XP environment was also compiled on a Windows Server 2008 to provide a 64bit version of the library. It should be executable in any platform as long as the suitable libraries are present. The computer system was composed of an Intel Core 2 Quad based processor at 2.66 GHz., with 4 GB of DDR2 RAM (clocked at 667 MHz), 2 x 250 GB SCSI hard disks and one Nvidia 8800 GTX graphics card installed. All these mounted on a suitable Asus motherboard. Any conventional camera or webcam can be used for video input, as long as appropriate drivers are installed. The library was tested with a Panasonic NV-DS15EG DV camera (resolution 720 x 576), a Creative PD1001 (resolution 320 x 288) and a VF0350 (resolution 320 x 240, 640 x 480) USB webcam and a Sony DFW-X700 Firewire (resolution 1,024 x 768) owned by the Computer Vision and Graphics Research Group at the Department of Computer Science on the University

55

of Glasgow. All times were measured by high precision timers that are provided from Windows kernel, which actually measure CPU clocks. Then they are converted into milliseconds. Attention must be paid during the measurement as when CUDA launches a kernel it returns the control immediately to the program on the CPU and will continue to execute as the GPU runs at the same time. That is why cudaThreadSynchronize() function was used before any measurement, which avoids this. Execution times in Matlab code were measured with the standard commands tic and toc . Having measured the execution times, tmatlab for sequential Matlab code and thart for the parallel implementation in GPU in HART library, it is able to make a comparison by calculating the speedup. Speedup S, refers to how much the GPU implementation is faster than the Matlab one, and is calculated by the following formula:

hart

matlab

t

tS =

During all the tests it must be ensured that any external factors that can influence the

procedure and falsify any results, in any way, are eliminated as good as possible. To achieve this, tests took place on a computer where a fresh copy of Windows Server 2008 (SP1) was installed and any unused hardware was disabled, such as any sound devices. No additional software was installed, except of absolutely necessary drivers and utilities.

To measure all the above a set of Matlab functions have been developed that allow the calculation of the results (included on the accompanied CD). Results are demonstrated and commented on chapter 5. There also in the availability of the reader and are included on the accompanied CD. Feedback and any possible comments are encouraged.

5.2 Retinal Responses

In this section, the reader has the ability to see how accurate are the image vectors created with the HART library. The comparison is made with the results of the most trustful software up to date, which is the Matlab vision system created by Balasuriya [1]. Measuring the mean squared error between the two vectors on a series of 10 pictures, the average error was 0.000045. This is quite encouraging and suggests that initial design of sampling is correctly working. Even this difference is adequately justified for two reasons. Firstly, as already stated, the HART library in order to save memory space uses floats for Gaussian kernels coefficients and secondly that the operating environment is different. So, it can be said that the Matlab version is more accurate in computations due to higher precision (it uses doubles for kernels coefficients). Differences are also expected in general as Matlab and C are two completely different programming environments. On figure 5-1 it can be seen the difference of the values on the sampled Lena image with a retina of 8,192 nodes in relation with the eccentricity of the retina.

56

Figure 5-2: Vector’s values difference between HART and Matlab.

The difference of values of a reference vector Vr and an observed vector Vo can be

expressed mathematically as:

or VV −

So, the highest of these values is:

)max( or VV −

The highest deviation noted in contrast scale is 0.0002 which suggests good quality

for the image vector. It can be noted that as eccentricity increases the deviation increases as well. This is normal, if one bear in mind that kernels grow as eccentricity grows. This means that more computations are involved so any errors are accumulated and the total error increases. This can be seen more clearly if we plot the difference of vectors when sampling a completely white image (all pixel values are equal to 255). The result is displayed on figure 5-3.

57

Figure 5-3: Vector attenuation over eccentricity.

Again, the error is very small but it’s more clear the vector attenuation over

eccentricity of the retina. The error increases as we move out from the fovea region.

5.3 Reconstructed Images

Results on reconstructed images for image vectors are encouraging as well. Ideally, subtracting two images, one generated with Matlab version and the other with the HART library, the result should be a totally black image. But as vectors are not exactly the same, differences are expected. This does not underestimate at all the validity of the results as the differences are very small for a human eye to notice them.

For the Voronoi reconstruction, on figure 5-4 it can be seen the difference of the two images. Contrast needed to be increased by 14x in order to notice the differences.

Figure 5-4: Difference of values between a HART and a Matlab Voronoi based reconstruction.

58

Even differences were expected it proves to be that the error is even smaller than it has been interpreted. Region values seem to be exactly the same. Differences are noticed on the borders of regions. It seems that reconstructed images from the HART library are translated for 1.5 pixels southeast. This cannot be considered as a serious limitation, as back-projection is only used to conceptually understand how an image vector looks like. A real vision system would deal only with the image vector.

As far as the Gaussian method is considered results are a little bit better. Even from the original Matlab vision system it has been noted that Gaussian reconstruction is better than the Voronoi because it offered a more smoothed image due to correct information continuity. Voronoi regions, perhaps give a crisper image but to the eye the final result looks too artificial. Furthermore, with the application of the normalization that was described on chapter 4, results were even better (check figure 4-10 compared to 4-9). Subtracting the reconstructed images, without normalization, the result was an almost blank picture as can been seen on figure 5-5a. Subtracting the Gaussian reconstruction from the normalized Gaussian reconstruction we can see the noise that is removed and caused the final image to look quite mottled in appearance. The resulted is presented on figure 5-5b. With normalized Gaussian reconstruction information continuity is much smoother. Again, contrast had to be increased by 14 times.

(a) (b)

Figure 5-5: (a) Subtraction of a HART and a Matlab Voronoi reconstructed image. (b) Subtraction of a HART and a Matlab Gaussian reconstructed image.

5.4 Pyramids Evaluation

As happened with the retinal responses, the responses from the pyramid layers have to be validated as well, as a different approach is used compared to the Matlab system. Defining as Direct the process of creating the pyramid layers by sampling the image with large kernels (implemented in HART library) and as Indirect the process of creating the pyramid layers by sub-sampling the previous finer image vector (implemented in Matlab by Balasuriya [1]), the author calculated the Root Mean Squared Error (referred as RMSE) and the Mean Difference of Absolute Values (referred as MDAV) of the following combinations: HART Direct vs. Matlab Indirect and Matlab Direct vs. Matlab Indirect. HART library, as already described, works

59

only with Direct method and Matlab with the Indirect one. However, the retinas used in the HART library were loaded into Matlab in order to provide a Direct version, for comparison. The results can be seen in the following figure (5-6):

Pyramid Layers’ Validation

Layer ’s Resolution (no. of

nodes) H-Direct vs. M-Indirect M-Direct vs. M-Indirect

RMSE MDAV RMSE MDAV

4,096 0.281 1.2188 0.281 1.2192

2,048 0.750 11.586 0.750 11.587

1,024 1.171 18.234 1.171 N/A

512 0.407 5.713 0.407 N/A

Figure 5-6: Root Mean Squared Error for the retina layers’ responses.

From the above chart, it is concluded that there are no significant differences between the two methods (Direct and Indirect). As the RMSE is exactly the same for both methods, it can be said that the HART library achieves identical results compared to the Matlab version. This has already been shown in section 5.1. In order to confirm it a comparison of HART Direct vs. Matlab Direct was made as well, and the results justified the initial claim (for the data of the above figure, results were: 0.0000964, 0.0000569, 0.0000432 and 0.0000739, respectively). However, these results occur from the RMSE comparison. The author, after obtaining these results, calculated the absolute difference of the values. The results are shown on the previous chart. Now, it can be said that differences exist but they are too small to be taken into account. HART library achieves slightly better results compared to the Matlab version. N/A values mean that the test was unable to complete maybe because the error was getting too big to fit in 32-bit variable. So there is a possibility that the error grows bigger in certain circumstances, but it is unknown to the author why this happens, at the time of writing of the present dissertation. An alternative explanation is made in section 5.7, and it’s due to a Matlab bug.

A justified explanation of why the error increases while layer’s resolution decreases is possibly due to the generation of large kernels which involves more complicated calculations. Reconstructed vector layers can be seen on figures 4-15 and 4-16, and they are of adequate quality. This is expected as the functions used are the same for the retina reconstruction and are already validated in section 5.2.

5.5 Execution Times

Speed, as it has been already stated numerous times in the present dissertation is one of the primary priorities. As a general overview of the execution times, compared to Matlab, performance has been increased a lot based on the evidence shown on figure 5-7.

60

Comparison of Execution Times (ms)

Function Matlab HART Library

Speedup (x) Initialization Total

Sample Grayscale 910 43.12 68.65 13.2

Sample Color 2,080 41.49 68.87 30.2

Reconstruct Voronoi Grayscale

41,270 3.20 446.87 92.3

Reconstruct Voronoi Color

40,030 4.12 537.29 74.5

Reconstruct Gauss Grayscale *

4,320 * 47.57 * 958.42 * 4.5 *

Reconstruct Gauss Color *

5,120 * 47.78 * 975.25 * 5.2 *

Figure 5-7: Execution times and speedup compared to Matlab version.

Certainly the biggest advantage is on the Voronoi reconstruction where the speed up

is greatest. However, in Gaussian reconstruction we must bear in mind that the implementation in Matlab is not the same as in the HART library as there is no normalization procedure. This certainly has an effect on performance because in any other case it would lead to an algorithm redesign and would have to measure it in this contrast. Surprisingly, even without normalization there is a small advantage on performance. If the HART library didn’t support normalization speedup is estimated to be definitely more than 10x. The fact that the HART are small enough suggests an adequate design and correct memory access. Sampling is more important than reconstructing, because this is what primarily a vision system will use. Reconstruction is only to get a conceptual idea of what an image vector represents. As one may notice initialization times are provided as well. Those times refer to anything that needs to happen before executing the kernel, meaning any memory allocations and the data transfer from the CPU to the GPU. Sampling and Gaussian reconstruction functions need more time because they transfer the image and the coefficient kernels to the GPU. Voronoi reconstruction does not need anything, but the sampling points. More time is reduced on the reconstruction as the new image is created directly on the GPU and transferred back to CPU. To all the above measurements a standard deviation was calculated as well (by using Matlab’s built-in function std ) but it was too small (2-3 ms) to make a difference, so it’s not reported.

The above times in terms of real-time input processing are equivalent of 2 frames/sec when sampling and reconstructing with Voronoi regions and 0.9 frames/second when sampling and reconstructing with the Gaussian. Of course, as retina sizes decrease performance will be increased. Below, on figure 5-10, is a screenshot of a program using HART library and which processes live video from a webcam in resolution 352 x 288.

61

Figure 5-8: Screenshot of a program while using HART library in real-time. On the next figure (5-9) we can see the execution times varying with the retina size.

The graph includes measurement for logz retina model as well.

Figure 5-9: Execution times varying with retina size. Logz modeled retina is included.

What happens is as expected. As retina nodes increase, execution times rise. The biggest problem though is noticed with the 8,192 and 16,384 node retinas. Specially the later one reached up to 2,400 ms. This is why those two were chosen to be optimized further, as described in the next paragraph. Sampling functions (grayscale and color) timings are almost identical everywhere.

On the next figure (5-10), we can see the execution timings while creating various pyramid tessellations.

62

Figure 5-10: Execution times varying with pyramid size.

As quality increases (moving from a coarser to a finer level) execution times arise as well. A small anomaly takes place on the sampling functions which cannot be explained by the author. One must not confuse the execution times comparing with the measurements made by varying the retina size in the previous figure. Even if the number of nodes is the same, the kernels are not. As nodes are reduced, kernels grow in order to cover all the required area. That’s why the resulting files are bigger as well (they reach an average of 20 times bigger). However, a formula cannot be found to connect the times between them. They vary from 20x to 100x. Performance can be decreased dramatically in some cases. For the same reason in the previous figure logz modeled retinas present higher execution times. They have few nodes covering a large area, and that is why the results are not so good. Increasing the number of nodes will increase the sampled area and not the quality.

Talking about execution times, it’s useful to refer here the benefits of binary mode I/O. On related tests, time needed to load a 16.384 node retina file is around 224 ms. This, theoretically can be translated to a speed of 240 MB/sec. The same speed applies to the other I/O functions in the library.

5.6 Kernel Optimization

In this section the results from the code optimization procedure are presented, as described in paragraph 4.6.2. On figure 5-11 we can see the register and shared memory usage for each of the kernels of the library.

63

Register and Shared Memory Usage per Function Kerne l

Kernel Register Count (no.) Shared Memory (bytes)

doSampleG 11 28

doSample3 16 28

doReconstructVG 17 32

doReconstructV3 16 32

doReconstructGG 23 36

doReconstructG3 23 36

Figure 5-11: Register and shared memory usage per kernel.

It is possible to compare these results with the predictions that CUDA Occupancy

Calculator has made, over the change in performance while altering register count or shared memory. These values are presented in the following two diagrams (figure 5-12 and 5-13):

Figure 5-12: GPU occupancy varying register count.

64

Figure 5-13: Occupancy varying shared memory usage.

From the above diagrams, our initial claims are proved that registers are by far more important than shared memory in order to increase performance. Fully optimality is achieved when sampling a grayscale image (kernel doSampleG ), where we get the best results by achieving 84% occupancy. Also, it is noticeable how important it is to reduce the register count even by one unit. Kernels doSample3 and doReconstructV3 have 16 registers. If no optimization procedure had happened, register count would have been at least 17, which would have dropped performance as occupancy decreases to 49% from 64% that is now. Gaussian reconstruction methods due to their increased complexity require more resources. This enables the program to increase parallelization by launching more threads at the same time. However, adjust the register usage by re-designing the code is not always possible. The author, at the moment of writing, cannot explain why doReconstructVG requires more registers than doReconstructV3 since the algorithm is the same and the second uses more resources than the first one. As Nvidia does not release any details for the nvcc compiler, issues like the previous ones will be unable to resolve them.

The amount of shared memory that is used corresponds to the memory needed for storing the arguments. Identical numbers between different kernels means that they use exactly the same arguments. Any thoughts on how to take advantage of shared memory and were characterized as inefficient previously; are now proven to be even more inefficient. If 16 KB of shared memory is too little for the library, as it can been seen from figure 5-8, in order not to affect performance its usage must be limited to only 4 KB. Of course, this puts some serious limitations.

As far as the reconstruction kernels is concerned, better times can be achieved by not reconstructing the whole image from initial principles, but only the area that is visible to the retina’s receptive fields. This technique was adopted from the original Matlab vision system, by defining a variable that represents the retina’s field of view or the fovea region. So, before any calculations each pixel checks if it is indeed inside that area. The author experimented with various fovea sizes applied to the different retina tessellations that have been created and concluded that the optimal values per retina model are the following (figure 5-14):

65

Fovea’s Recommended Values

Retinal Nodes Fovea Size

16 8

64 15

256 36

512 55

1,024 73

2,048 100

4,096 120

8,192 180

16,384 270

logz 2,304 150

logz 4,096 200

Figure 5-14: Fovea’s field of view recommended values. It is obvious that performance is increased, because the algorithm does not have to

process the whole image but only a part of it. Measured tests demonstrated that performance can increased up to 50% for an 8,192 retina. Note that these values do not apply for the pyramid layers. In that case fovea size is always the same as the finer level’s fovea, not matter the size of the layer itself.

5.7 Retina Optimization As has been stated in the previous paragraph apart from any code optimization, a

special optimization procedure took place for retina sizes of 8,192 and 16,384. The optimization was actually based on exhaustive benchmarking in order to find the optimal number of threads and blocks. Though, as explained on paragraph 4.6.1 a small algorithm has been developed that finds a valid number of [blocks, threads] that won’t cause the kernel to crash, no matter the input data (both retina and image). This algorithm is based on the observations described in paragraph 4.6.2. In this section the process of optimization will be briefly described for the 8,192 retina. For the 16,384 retina, the process remains the same, so only the results will be presented.

Firstly, an investigation was made by altering the number of threads and blocks separately. The number of blocks, it can be said that denotes how much a kernel is parallelizable by using independent resources and the number of threads denotes how much a kernel is parallelizable by using shared resources. On the figures 5-17 and 5-18 we can see the results when altering the number of blocks and the number of threads for each kernel, respectively, while keeping all the other factors constant.

66

Figure 5-15: Execution times varying with blocks number. Printed values show optimal performance.

Figure 5-16: Execution times varying with threads number. Printed values show optimal performance. From the above diagrams the first observation that is being made is that time

deviation is relatively small (excluding some extreme conditions). The biggest noted difference in time on varying block size is 50 ms, and on varying thread size is 40 ms. So, any performance gain that may be achieved after optimization won’t be tremendous. However, this shouldn’t discourage us and stop the optimization, as useful conclusions are made and after all when talking about a real-time system everything does matter.

Next, it can be seen that sampling functions execute more efficient having more blocks than threads, the Voronoi color reconstruction more threads than blocks and

67

Gaussian plus Voronoi grayscale reconstruction functions prefer an average combination of both.

Before continuing with the optimization process it must be taken into account of how the reconstruction method works. The image is split into tiles and each tile is processed independently. Therefore, it follows that optimal tile size needs to be determined before continuing the optimization procedure. Figure 5-17 below shows such benchmark:

Figure 5-17: Reconstruction times varying with tile size. Printed values show optimal performance. At values (2, 2) and (18, 18) test did not complete.

The conclusion is that the smaller the tile size, the better. After determining the

optimal tile size (which is [4, 4]), the process is continued. Since we know the exact number of threads the only thing remaining is how to split them on different blocks. To achieve this, we define a splitting factor called ‘multiprocessors’, resembling the notion that each block runs uniquely on a single multiprocessor. So, once again, benchmarking is made by adjusting this parameter. The maximum value of this parameter is actually the number of processors that exist on the graphics card (for example, 16 on Ge-Force 6800 GTS). However, putting limits on the range was not an efficient solution, so tests were made with bigger values as well. The results can be seen on the following figure (5-18):

68

Figure 5-18: Execution times varying with multiprocessors. Printed values show optimal performance.

As happened with figures 5-15 and 5-16, no big time differences (in terms of

standard deviation from the mean) are noticeable (biggest difference is 13 ms). So, after checking the final results it has been concluded that the optimal values for the ‘multiprocessor’ parameter are as below (figure 5-19):

Optimal Multiprocessor Values

Kernel 8,192 retina 16,384 retina

doSampleG 64 512

doSample3 24 512

doReconstructVG 4 1,024

doReconstructV3 16 1,024

doReconstructGG 128 128

doReconstructG3 128 128

Figure 5-19: Multiprocessor variable values. The total gain in performance with the optimization practice is at the most

approximately 1.6% for the 8,192 retina and 4.5% for the 16,384 retina. Speed up is not significant, but in any case useful conclusions were made during this procedure. It is useful to note at this point that CUDA defines grid and block size as vectors with 3 dimensions. This does not play any importance at all. The only thing that matters is the total number of threads that is included inside a block or the total number of blocks included inside a grid. For example there is no difference between a vector with dimensions [2, 3, 1] with another vector with dimensions [3, 2, 1].

If the reader wants to check the execution time more in detail, they are available along with all the diagrams in the accompanied CD in Excel format.

69

5.8 Discussion & Conclusion On this chapter the author presented the results from the validation procedure that

was described in chapter 2. Some comments were made on the results and interesting conclusions were reached.

First of all, the reference model that was taken was the results obtained from the Matlab vision system. But the results cannot be directly compared with them as in reality there is no a global, valid and widely accepted reference model. The Matlab system was chosen because it’s the most recent research on the field of artificial retina, concurrent and valid at the same time. So, when differences occur it is very difficult to give a justified reason on why these differences occur. As commented on section 5.3, the results that the HART library produced did not have a big deviation from Matlab’s results. Any existing differences can be justified from the different precision used on calculations (Matlab uses 8 bytes and HART library 4 bytes for float values), and maybe on computations while finding the center of each kernel’s receptive field. This is a crucial point, because not sampling at the correct point of image will result in artifacts as noticed on Fegan’s [11] results. In the current project, this is why pictures were passed as textures to graphics card memory. Textures enable sub-pixel precision and automatic boundaries detection. So, CUDA is responsible for finding the right point at which to sample.

The created algorithm for finding a valid combination of [blocks, threads] could be easily substituted by passing two further arguments while calling the function. Even if programmatically it’s feasible to happen, it would make the library difficult to use as a lot of decisions have to be made before choosing a number. For example, blocks and threads are 3-dimensional values with limited capacity, or there is certain number of registers per block. Furthermore, there would have been no optimization as described in paragraph 5.5. Though, as it can be told from the execution times, there are no big differences by altering blocks or threads numbers. Finally, since results from sampling and reconstructing an image are valid, and the same functions are used for any pyramid layers the validation for pyramids justified the initial claims. The results were more accurate comparing to Matlab version, since the sampling is done directly to the image and not to an approximation of the values of the previous finer layer. However, the author was not able to perform all tests in every single layer, since problems with Matlab raised. It seems that version 7.0.1 (R14) that was used during the experiments has some memory leaks and causes the program to crash when managing large kernels. Comparisons with a newer version of Matlab (2007a is currently available) on a 64bit machine for bigger memory address, are needed in order to fully justify our claim.

Execution times are very promising for what is possible to happen with GPU acceleration in general. The library may not be able to achieve a performance of 25 fps, but this does not lower the importance of the implementation, because of two main reasons. Firstly, as already explained, in a real biologically motivated vision only sampling is needed in order to function properly. And sampling functions in the HART library are much faster than reconstructing, being able to achieve a 15 fps performance (with no reconstruction at all and sampling with an 8,192 node retina). Every further processing will take place on the image vectors. Secondly, is what is defined as “real-time” performance. Scientifically, it has been proved the human eye

70

can not understand any difference when more than 25 fps are displayed. But this is just an optimal situation applicable only to video watching. To a real time computer vision system, which has to deliver specific services in a timely manner this is almost always impossible (mainly because of the execution cost and the complexity of the algorithms). But this doesn’t mean that it’s not working in real time. HART library can become more “real time” by lowering the standards. On a 320 x 240 pixels resolution video input, applying a 4,096 node retina 4 fps can be achieved when sampling and reconstructing with Voronoi regions (2.5 fps with normalized Gaussian). And this is a fairly good execution time, for real time needs.

Bringing to an end this dissertation, the next chapter will provide an overview of the work that has been done as well as some potential ideas for expanding or making better the HART library.

♦

71

Chapter 6

Final Conclusion

This final chapter includes a summary of what has been achieved and the results obtained in terms of this project. Also, there is a paragraph that refers to some possible extensions that can take place to the library, in order to improve either its functionality or its capabilities.

6.1 Overview

This project involved an investigation into mapping an artificial retina to a hardware accelerated GPU. The software output of this work is a library, named as Hardware Artificial Retina Transformation (HART). This optimized library is responsible for integrating the basic functions of an artificial retina to the parallel architecture of recent graphics cards, manufactured by Nvidia. The library was created with Compute Unified Device Architecture (CUDA) - an SDK provided by Nvidia that is an extension to the well known C programming language. The basic functions that are supported by the HART library are:

• Handling the artificial retina, no matter its size and type. • Fast sampling of an image to an image vector. • Fast reconstruction of a sampled image vector back to the image plane,

using two methods (Voronoi regions and normalized Gaussian). • Fast creation of vector pyramid layers by using a valid artificial retina

model.

In addition, other functionality was added in order to facilitate its usage and improve the overall functionality:

• Basic displaying support using OpenGL. • Camera manipulation for real-time input feed. • Basic I/O for all the newly created data structures (retina, image vector

and images). • Interoperability with any other library or program, using the interface

provided by the Dynamic Linking Libraries in Windows.

Results, compared to the already existing Matlab vision system are very good. In terms of precision a rooted mean squared error of 0.00045 is achieved by comparing 10 vectors which were sampled using different retinas and image sizes. In terms of speed, which is most important, the library it is able to perform a full artificial retina transformation in less than one second (using a 512 x 512 color image, 8,192-node retina and Voronoi region for back-projection). The importance of normalization in the Guassian reconstruction was also noted in this project, which produced artifact-free images. Finally, some interesting programming techniques were discussed during

72

this dissertation that are absolutely vital for any programmer to know when CUDA programming is involved (for example, how to reduce the register usage). Finally, it was demonstrated why sampling directly the image is better than sub-sampling the previous responses, when generating vector pyramid layers. As we move to coarser layers, the error accumulates and the result deviates from the correct value.

The created library is a useful tool for anyone that wants to test different retina tessellations in different data in a fast, easy and reliable way. It can also be used as an intermediate filter that will provide a biologically motivated aspect of vision in most of the Computer Vision algorithms or systems. This is quite interesting if one bear in mind the advantages of the artificial retina and the potential that it has to become more widely used, as working with large retinas is no longer a problem. The main advantage of using an artificial retina is the compression on the image data that is being achieved. If the human brain was made that way so that it could process the whole field of view, then it should weight around 60 kilograms. The same applies in Computer Vision. It’s faster to process a couple of thousand retinal responses than a whole image pixel-by-pixel which can have millions of data. Other scientists, have found the use of artificial retina useful in their own way, such as Tunley and Young [42] while calculating the first order optic flow. The collaboration of the present project to artificial retina is, firstly, that the bottleneck of execution time no longer exists, allowing it to execute even in near real-time conditions. Furthermore, a normalized version of the Gaussian reconstruction was introduced, which produces better reconstructed images. At last, it offers better quality when creating pyramid layers and it is able to project those layers with Voronoi regions (that was unable to do so before). The fact that the library has been tested with different retinas ensures compatibility with any other retina tessellation. However, some limitations may exist. For example, in the accuracy of the calculations as single bytes are used for pixel values and floats for any real numbers.

The results from the present work have already handed in to Mr. Gerardo Aragon (contact: [email protected]) [43] and Mr. Indradeo Ram (contact: [email protected]), both Phd students at the Computer Vision and Graphics Group of the Department of Computer Science at the University of Glasgow, for further investigation, as they found the work useful for their own individual projects. Furthermore, in cooperation with my supervisor Dr. Paul Siebert there are plans to scientifically publish the present work in a conference or a journal of an adequate research field, such as Compressive Sensing, High Performance Computing or Computer Vision.

6.2 Further Work

With the advantage of a Dynamic Link Library, HART offers a wide variety of options in order to be callable from other programs. So, there is no problem of integration, apart from the fact that the specific data types must be recreated. However, there are some other aspects that can be improved: • Further optimization

There is scope for further optimization of the algorithm developed here. For example, the fact that before each call to the kernel the image is transferred to the GPU. This is unnecessary when creating pyramids, since all layers sample the same image. A total of 120ms can be reduced in execution time when creating 4 pyramid

73

layers. Also, the retina manipulation can be changed when allocating new memory. A memory pooling technique can be introduced to allocate new memory, so that the data can be transferred as a single block of memory to the GPU and not grouping the kernels together in order to transfer them, before every execution (even if it’s fast). An optimization procedure as described in paragraph 5.5, that would apply to all the possible retina sizes and models is quite useful. The author may not have achieved big performance improvements with optimization but it’s not determined what will happen when using different retina models (such as logz+α). Also, time can be reduced when reconstructing with the normalized Gaussian method. Currently an iteration to all of the receptive fields is made in order to gather all the kernel coefficients from overlapping fields. This can be avoided by passing the neighbors of each receptive field and iterate only between them. Though, this needs sending more data to the GPU. • Further interaction

New interaction ideas can be introduced in order to make the library even more easy to use. It would be potential useful to be able to rove the retina in real-time when displaying static images, as it is happening now with camera video input. In order to extend capabilities it would be interesting to able and sample the same image with more than one retina and then reconstruct one final image with information from all the sampled ones. • Further parallelization

A further study would be on how to maximize performance when two or more graphics cards are available. In such case the programmer has to choose which operation will be executed on which graphics card. There are official announcements from Nvidia, that future releases of CUDA will allow transferring data between different graphics cards. For example, a potential robot-head can use one GPU for stereo image vengeance, another one for image sampling and another one for image reconstruction. Dividing processing power into functions can result in true real-time performance.

♦

74

References & Bibliography

[1] Balasuriya L. S., “A Computational Model of Space-Variant Vision Based on a

Self-Organised Artificial Retina Tessellation”, Ph.D. Thesis, Department of Computing Science, University of Glasgow, Glasgow, March 2006.

[2] Balasuriya L. S. and Siebert Paul, “An Architecture for Object-based Saccade

Generation using a Biologically Inspired Self-organised Retina”, Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, July 2006.

[3] Balasuriya L. S. and Siebert Paul, “Hierarchical Feature Extraction using a Self-

Organised Retinal Receptive Field Sampling Tessellation”, Neural Information Processing - Letters & Reviews, Vol. 10, No. 4 - 6, pp. 83 - 95, April-June 2006.

[4] Balasuriya L. S., Siebert Paul and Cockshott Paul, “Functional and Operational Documentation for Retina-based Indexing Software”, European Project: Integrated

Project Research Area CINE (IST-2-511316-IP), D6.3.3, Draft Version 1, 2005. [5] Boyling T. A., Siebert J. P., “Foveated Vision for Space-Variant Scene

Reconstruction”, Proceedings of the 35th International Symposium on Robotics, (ISR), Nord Villepinte, Paris, France, pp. 1-6, March 2004.

[6] Bolduc M. and Levine M. D., “A Real-time Foveated Sensor with Overlapping

Receptive Fields”, Real-Time Imaging Journal, Vol. 3, pp. 195 - 212, Elsevier, 1997.

[7] Brugnot S., Siebert J. P., and Cowan C. W., “Inductive Generation of Icon Trees

in Foveated Multi-Resolution Recognition”, Proceedings of the 7th International Conference on Image Processing And Its Applications, Vol. 1, pp. 275 – 279, July 1999.

[8] Burt P. J. and Adelson E. H., “The Laplacian Pyramid as a Compact Image Code”,

IEEE Transactions on Communications, Vol. 31, No. 4, pp. 532 – 540, 1983. [9] Campbell N. A., Reece J. B., “Biology”, 7th International Edition, pp. 1061 - 1063,

Pearson Benjamin Cummings Education International, 2005. [10] Clippingdale S. and Wilson R., “Self-similar Neural Networks Based on a

Kohonen Learning Rule”, Neural Networks Journal, Vol. 9, No. 5, pp. 747 - 763., July 1996.

[11] Fegan S., “Porting an Artificial Retina from MATLAB”, M.Sc. Thesis,

Department of Computing Science, University of Glasgow, Glasgow, September 2007.

[12] Ferrari F., Nielsen J., Questa P., Sandini G., “Space Variant Imaging”, Sensor

Review Journal, Vol. 15, No. 2, pp.17 - 20, 1995.

75

[13] Fung J., Mann S., Aimone C., “OpenVIDIA: Parallel GPU Computer Vision”,

Proceedings of the ACM Multimedia International Conference, Singapore, pp. 849 – 852, 2005.

[14] Gomes H., “Model Learning in Iconic Vision”, Ph. D. Thesis, University of

Edinburgh, Scotland, UK, 2002. [15] Horn B. K. P., “Robot Vision”, MIT Press, McGraw Hill, 1986. [16] Labatut P., Keriven R. and Pons J., “A GPU Implementation of Level Set

Multiview Stereo”, Proceedings of the 6th International Conference on Computational Science (ICCS), Reading, UK, SpringerLink, May 2006.

[17] McGwin G., Xie A. and Owsley C., “Rate of Eye Injury in the United States”,

Archives of Ophthalmology Journal, Vol. 123, No. 7, pp. 970 - 976, July 2005. [18] NVIDIA, “NVIDIA CUDA Compute Unified Device Architecture Programming

Guide”, Version 1.1, Nvidia Corporation, November 2007. [19] NVIDIA, “The CUDA Compiler Driver NVCC Manual”, Version 1, Nvidia

Corporation, November 2007. [20] O'Rourke J., Computational Geometry in C. New York, Cambridge University

Press, 1994. [21] Perlin K., “Image synthesis: An Image Synthesizer”, Proceedings of the 12th

annual conference on Computer Graphics and Interactive Techniques, pp. 287 - 296, 1985.

[22] Sandini G., Questa P., Scheffer D., Dierickx B. and Mannucci A., “A Retina-

Like CMOS Sensor and its Applications”, Proceedings of the IEEE Sensor Array and Multichannel Signal Processing Workshop, March 2000.

[23] Schwartz, E. L., “Computational Anatomy and functional architecture of the

striate cortex”, Vision Research Journal, Vol. 20, pp. 645 - 669, 1980. [24] Schwartz E. L., “Spatial mapping in primate sensory projection: Analytic

structure and relevance to perception”, Biological Cybernetics Journal, Vol. 25, pp. 181 - 194, 1977.

[25] Siebert P., “The Fast Retino-Cortical Transform”, IE Student Project Support,

Release 1.1.1, Turing Institute, December 1991. [26] Siebert J. P. and Eising I., “Scale-Space Recognition based on the Retino-

Cortical Transform”, Proceedings of the IEEE 5th International Conference on Image Processing and its Applications, Edinburgh, Scotland, UK, July 1995.

[27] Siebert J. P., “Digital Image Processing”, Level 4/M Lecture Notes, Department

of Computing Science, University of Glasgow, 2008.

76

[28] Sinha S. N., Frahm J., Pollefeys M. and Genc Y., “GPU-based Video Feature

Tracking And Matching”, Workshop on Edge Computing Using New Commodity Architectures (EDGE), Chapel Hill, North Carolina, USA, May 2006.

[29] Tumulty G. and Resler M. M., “Eye Trauma”, The American Journal of Nursing,

Vol. 84, No. 6, pp. 740 – 744, June 1984. [30] Vassilas N., “Digital Image Processing”, Level 4 Lecture Notes, Technological

Educational Institute of Athens, 2007. [31] American Health Assistance Foundation, anatomyEyeNew.jpg, Image of an eye,

http://www.ahaf.org/glaucoma/about/AnatomyEye.htm (link last verified 25/8/2008).

[32] Digit-Life, xfx-8800gts-scan-front-small.jpg, Image of a Nvidia graphics card,

http://www.digit-life.com/articles2/video/g80-2.html (link last verified 25/8/2008).

[33] Toms Hardware,

http://www.tomshardware.com/2008/03/18/nvidia_geforce_9800_gx2_review/page4.html, (link last verified 25/8/2008).

[34] Sinauer Associates Inc., Neuroscience, Second Edition, ch11f4.gif, Structure of

the retina, http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=.0gLpPO__22i8C3LhKCEBD4P2ACBFViWcLTIEFT (link last verified 25/8/2008).

[35] Wikipedia, The Free Encyclopedia, “Graphics Processing Unit”,

http://en.wikipedia.org/wiki/Graphics_processing_unit (link last verified 25/8/2008).

[36] Marco F. D., Mark A. D., Dharmpal T., Jason N. L., Ting S., Kevin F. and

Richard G. B., “Single Pixel Camera”, Single Pixel Imaging via Compressive Sampling, IEEE Signal Processing Magazine, March 2008.

[37] Voronoi W., Eric W., “Voronoi Polygon”, MathWorld - A Wolfram Web

Resource. http://mathworld.wolfram.com/VoronoiPolygon.html (link last verified 25/8/2008).

[38] Intel Corporation, “Open Computer Vision Library”, OpenCV,

http://opencvlibrary.sourceforge.net/, 2001 (link last verified 25/8/2008). [39] Wikipedia, The Free Encyclopedia, “Root Mean Squared Error”,

http://en.wikipedia.org/wiki/Root_mean_square_deviation [40] Fletcher R. with permission from Robbins N., GLUT, The OpenGL Utility Toolkit http://www-users.york.ac.uk/~rpf1/glut.html (link last verified 25/8/2008).

77

[41] Moore G. E., Moore’s Law, “Cramming more components onto integrated

circuits”, Electronics Magazine, 1965 . [42] Tunley H., Young D., “First Order Optic Flow from Log-Polar Sampled

Images”, Third European Conference on Computer Vision (ECCV), Vol. I, pp. 132 - 137, 1994.

[43] Haitham F., Camarasa A. G., Siebert J. P., “Towards Binocular Active Vision in

a Robot Head System”, IEEE’s Towards Autonomous Robotic Systems Conference, University of Edinburgh, 1-3 September 2008.

♦

78

Appendix A Occupancy per Kernel

The results presented in this section are generated with CUDA Occupancy Calculator and show which number of threads is optimal for each function that is executed on the GPU. One must not forget that the optimality of number of threads does not achieve necessarily optimal performance for the program as well. As already has been explained, the benchmarking procedure is trustful for that as execution times depend heavily from data input and not all the threads execute the same operations (there can be recognized two levels of activity, find pixels values or do nothing). In every diagram, printed values show when optimal performance is achieved (the larger, the better). • Kernel doSampleG (max. occupancy: 83%)

79

• Kernel doReconstructVG (max. occupancy 50%)

• Kernels doReconstructV3, doSample3 (max. occupancy 67%)

80

• Kernels doReconstructGG, doReconstructG3 (max. occupancy 33%)

It’s easy for anyone to notice why, for example, the Gaussian reconstruction needs more time than the Voronoi one. Due to increased complexity, it uses more registers, which prevents the kernels to run with many threads. Also, some kernels have exactly the same diagram. This proves the poor way of calculating the occupancy that does not require any other information than the registers and the amount of shared memory used. This is why occupancy diagrams are not considered by the author trustful means of comparison.

81

Appendix B Full HART API

Here is a complete list of the functions that the HART library has, along with details of the arguments list as well as any return values. image * Init_Image();

imvector * Init_Vector();

void Load_Image( image * img);

retina * Load_Retina( char * filename, unsigned int * size);

void Load_Vector( imvector * vec, char * filename);

void Save_Image( image * img);

void Save_Vector( imvector * vec);

void Display_Image( image * img);

imvector * Sample( image * img,

retina * ret,

unsigned int rsize,

unsigned int fix_x,

unsigned int fix_y );

image * Reconstruct( imvector * sample,

retina * ret,

unsigned int rsize,

unsigned int fix_x,

unsigned int fix_y,

unsigned int rfovea,

char type );

void Query_Device();

void HART_Init( int argc, char * argv[]);

void Camera( retina * ret,

unsigned int rsize,

unsigned int fix_x,

unsigned int fix_y,

unsigned int fovea,

int width,

int height,

char type );

82

Appendix C Examples

In this final section two programming examples will be provided in order for the reader to get a first, rough idea on how to use the HART library and how the programming in CUDA looks like.

• Example on using the HART library This example shows HART’s basic capabilities. The program will read an image, read a retina, sample the image, reconstruct the image using Voronoi regions and finally saves both the sampled image (in form of an image vector) and the reconstructed image on the hard disk. // Includes [file: im_test.c] #include <stdio.h> #include <windows.h> #include <stdlib.h> #include "hart.h" int main( int argc, char * argv[]) { image *pic = Init_Image(), *result; imvector* sampled = Init_Vector(); retina* ret; unsigned int size, fovea = 180; printf( "Initializing...\n" ); HART_Init(argc, argv); // make necessary initializations printf( "Loading retina...\n" ); ret = Load_Retina( "C:\\retina\\retina4096.ret" , &size); pic->filename = "C:\\data\\camera.ppm" ; printf( "Loading image...\n" ); Load_Image(pic); printf( "Sampling image...\n" ); sampled = Sample( pic, // image to sample ret, // retina to sample with size, // number of retina nodes NULL, // fixation point, x NULL // fixation point, y ); printf( "Reconstructing image...\n" ); result = Reconstruct( sampled, // sampled image (image vector) ret, // retina to sample with size, // number of retina nodes NULL, // fixation point, x NULL, // fixation point, y fovea, // field of view VORONOI // type of // reconstruction // (VORONOI or GAUSS)

83

printf( "Displaying image...\n" ); printf( " (press Enter or Escape to exit)\n" ); Display_Image(result); // display reconstructed image printf( "Saving image...\n" ); Save_Image(result); // save result image to source's directory printf( "Saving vector...\n" ); Save_Vector(sampled); // save imvector to source's directory printf( "\n* Execution completed succesfully! *\n" ); getchar(); return 0; }

• Example on programming with CUDA

Below there is one of the many examples that the writer wrote to gain experience with CUDA programming. The program is very simple. It performs in parallel the power function in ten randomly generated numbers. The kernel is stored to a different file than the rest of the program, for better understanding. The program makes use of shared memory as well, in order to increase performance. It’s a usual practise in CUDA programs to keep code that runs on the GPU on a separate file than the code hat runs on the CPU.

// Includes, system [file: test.cu] #include <stdlib.h> #include <stdio.h> #include <string.h> #include <math.h> // Includes, help libraries #include <cutil.h> // Includes, kernel #include <test_kernel.cu> // Number of elements #define N 10 // Main program int main( int argc, char ** argv) { // Initialize device CUT_DEVICE_INIT(); // Setup execution parameters dim3 threads(10, 1, 1); dim3 grid(1, 1); // Variables int host_data[N], res_data[N]; int * devc_data; bool passed = true ; // Initialize values for ( int i = 0; i < N; i++)

84

host_data[i] = rand(); // Allocate memory to device CUDA_SAFE_CALL(cudaMalloc(( void **)&devc_data, sizeof ( int ) * N)); // Transfer data from host to device CUDA_SAFE_CALL(cudaMemcpy(devc_data, host_data, sizeof ( int ) * N, cudaMemcpyHostToDevic e)); // Execute the kernel in device powercu <<< grid, threads, sizeof ( int ) * N >>>(devc_data); // Check if kernel execution generated and error CUT_CHECK_ERROR( "Kernel execution failed" ); // Transfer data from device to host CUDA_SAFE_CALL(cudaMemcpy(res_data, devc_data, sizeof ( int ) * N, cudaMemcpyDeviceToHos t)); // Deallocate memory from device CUDA_SAFE_CALL(cudaFree(devc_data)); // Check if results are OK for ( int i = 0; i < N; i++) if (res_data[i] != (host_data[i] * host_data[i])) pas sed = false ; if (passed == true ) printf( "Test completed succesfully." ); else printf( "Test failed to complete." ); // Shutdown, terminate CUT_EXIT(argc, argv); return EXIT_SUCCESS; } #ifndef _TEST_H_ // [file: test_kernel.cu] #define _TEST_H_ // Includes #include <stdio.h> // Number of elements #define N 10 __global__ void powercu( int * data) { extern __shared__ int shared[]; const int tid = threadIdx .x; // Copy input to shared memory shared[tid] = data[tid]; // Synchronize to make sure data is loaded __syncthreads(); // Perform operation shared[tid] *= shared[tid]; // Synchronize to make sure that the preceding // computation is done before loading data for // the next iteration __syncthreads(); data[tid] = shared[tid]; } #endif

Investigating the Potential for Hardware Accelerating of ...€¦ · Investigating the Potential...

Documents

Transcript of Investigating the Potential for Hardware Accelerating of ...€¦ · Investigating the Potential...