FYP SPRING REPORT - read.pudn.comread.pudn.com/downloads392/doc/project/1678502/Face Recgnition...

Face Recognition on FPGA Spring Term Report

EECE 501

Final Year Project

Ramzi Madi

200200055

Robin Lahoud

200200271

Bassem Sawan

200200267

Supervisor Prof. Mazen Saghir

May 23, 2006

2

TABLE OF CONTENTS

1. INTRODUCTION ......................................................................................................... 5

1.1 Problem Definition .............................................................................................. 5

1.2 Applications .......................................................................................................... 5

1.3 Motivation and Objectives ................................................................................ 7

2 LITERATURE REVIEW ............................................................................................... 7

2.1 Still-Image versus Video.................................................................................... 7

2.2 Algorithms for Face Recognition .................................................................... 8

2.2.1 Principle Component Analysis ..................................................................... 8

2.2.2 Linear Discriminant Analysis ........................................................................ 9

2.2.3 Independent Component Analysis ............................................................ 10

2.2.4 Neural Networks ........................................................................................... 11

2.2.5 Genetic Algorithms....................................................................................... 12

2.3 FPGA Implementation of Face Recognition............................................... 14

2.3.1 FPGA Implementation using PCA ............................................................. 14

2.3.2 FPGA Implementation Using Composite/Modular PCA......................... 15

2.3.3 FPGA Implementation using Artificial Neural Networks......................... 17

2.3.4 FPGA Implementation using Genetic Algorithm...................................... 18

2.3.5 FPGA implementation using Evolutionary Reconf. Architecture .......... 20

2.4 Issues with Face Recognition ........................................................................ 21

3. DESIGN....................................................................................................................... 23

3.1 System Specifications ..................................................................................... 23

3.1.1 Algorithm........................................................................................................ 23

3.1.2 Inputs and Outputs....................................................................................... 24

3.1.3 Timing Constraints ....................................................................................... 25

3.2 System Description........................................................................................... 25

3.3 Hardware FPGA Components ........................................................................ 27

3.3.1 V2MB-1000 Overview.................................................................................. 27

3.3.2 MicroBlaze Processor.................................................................................. 29

3.3.3 OPB Interface ............................................................................................... 29

3.3.4 BRAM Controller........................................................................................... 30

3.3.5 External Memory Controller........................................................................ 30

3.3.6 Ethernet MAC Controller/Driver ................................................................. 30

3.3.7 UARTLite Controller/Driver ......................................................................... 31

3.3.8 On-board Hardware Multipliers .................................................................. 31

3.4 Software FPGA Components ......................................................................... 32

3.5 PC Application Components .......................................................................... 33

3.6 Memory Management ....................................................................................... 33

3.7 PCA Algorithm ................................................................................................... 35

3.7.1 Training Phase.............................................................................................. 35

3.7.2 Recognition Phase ....................................................................................... 36

3.8 Project Budget ................................................................................................... 37

4. IMPLEMENTATION .................................................................................................. 38

4.1 Modeling Algorithm in MATLAB.................................................................... 38

4.1.1 Implementation Details................................................................................ 38

3

4.1.2 Implementation Results............................................................................... 39

4.1.3 Exporting Code to C .................................................................................... 40

4.2 Training Stage Implementation in C............................................................. 41

4.2.1 Custom Function Library ............................................................................. 41

4.2.2 Training Stage Calculations........................................................................ 42

4.2.3 Writing to Binary Files.................................................................................. 45

4.3 Ethernet Implementation in C# ...................................................................... 46

4.4 Recognition Stage Implementation on FPGA............................................ 48

4.4.1 Receiving Ethernet frames and Storing to DDR...................................... 48

4.4.2 Verification and Testing of Ethernet interface and DDR ........................ 51

4.4.3 Recognition Phase Implementation on the FPGA .................................. 52

4.4.4 Verification and Testing of Recognition Phase........................................ 54

4.4.5 Implementation of Performance Measurements ..................................... 54

5. CRITICAL APPRAISAL ........................................................................................... 55

5.1 Researching Face Recognition ..................................................................... 55

5.2 Modeling in MATLAB........................................................................................ 56

5.3 Working with C................................................................................................... 57

5.4 Using the FPGA ................................................................................................. 58

5.5 Researching Hardware Multipliers ............................................................... 59

5.6 Learning to Use Ethernet ................................................................................ 60

5.7 Porting PCA to the FPGA ................................................................................ 61

5.8 Performance Assessment............................................................................... 62

6. RESULTS ................................................................................................................... 64

6.1 Methodology Overview .................................................................................... 64

6.2 PC Implementation............................................................................................ 65

6.3 FPGA Implementations .................................................................................... 65

6.4 Overall Performance Analysis ....................................................................... 67

7. EXTERNAL FACTORS AND CONSTRAINTS .................................................... 69

8. CONCLUSION ........................................................................................................... 71

9. REFERENCES........................................................................................................... 72

10. APPENDIX ............................................................................................................... 73

10.1 PCA Code in MATLAB ................................................................................... 73

10.2 PCA Code in C ................................................................................................. 73

10.2.1 Matrix Library .............................................................................................. 73

10.2.2 Eigenvector Functions............................................................................... 77

10.2.3 PCA Algorithm ............................................................................................ 84

10.3 Ethernet Code in C# ....................................................................................... 86

10.4 FPGA Code ....................................................................................................... 89

10.5 Recognition Results ....................................................................................... 94

4

LIST OF FIGURES Figure 1: Applications of Face Recognition …………………………………………… 6

Figure 2: Face Recognition Algorithms ………………………………………............... 13

Figure 3: System Block Diagram.………….. …………………………………………. 26

Figure 4: V2MB-1000 Board …………………………………………………………... 28

Figure 5: P160 Communications Module ……………………………………………… 29

Figure 6: Ethernet CAT5 Cable ………………………………………………………... 37

Figure 7: RS-232 Serial Cable …………………………………………………………. 37

Figure 8: Sample Face ………………………………………………………………… 38

Figure 9: Average Face ……………………………………………………………….. 38

Figure 10: Eigenfaces …………………………………………………………………... 59

Figure 11: FPGA Comparison…………………………………………………………... 68

Figure 12: Total Performance…………………………………………………………... 68

LIST OF TABLES

Table 1: MATLAB Functions Used ……………………………………………………. 38

Table 2: MATLAB Implementation Results……………………………………………. 39

Table 3: C Functions Used ……………………………………………………………... 41

Table 4: Implementation Descriptions…..……………………………………………... 64

Table 5: Implementation 1 Results…..…..……………………………………………... 65

Table 6: Implementation 2 Results…..…..……………………………………………... 66

Table 7 Implementation 3 Results…..…..……………………………………………... 66

5

1. INTRODUCTION

1.1 Problem Definition

Face recognition is a form of biometric identification that relies on data acquired

from the face of an individual. This data, which can be either two-dimensional or three-

dimensional in nature, is compared against a database of individuals. In recent years, face

recognition has gained popularity among researchers all over the world. With

applications ranging from security to entertainment, face recognition is an important

subset of biometrics.

In real world applications, it is desirable to have a stand-alone, embedded face

recognition system. The reason is that such systems provide a higher level of robustness,

hardware optimization, and ease of integration. As such, we have chosen the FPGA as a

reconfigurable platform to carry out our implementation. Ultimately, the stand alone

system may be implemented on an ASIC, a dedicated processor, or even an FPGA chip,

depending on the trade-offs in speed, portability, and reconfigurability.

1.2 Applications

Face recognition systems have gained a great deal of popularity due to the wide

range of applications that they have proved to be useful in. Broadly, two main categories

for these applications exist: commercial applications and research applications.

From a commercial standpoint, face recognition is practical in security systems

for law enforcement situations. It is in places like airports and international borders that

the need arises for a face recognition system that identifies individuals. Another

application of face recognition is the protection of privacy, obviating the need for

6

exchanging sensitive personal information. Instead, a computer-based face recognition

system would provide sufficient identification. For instance, PIN numbers, user ID’s, and

passwords would be replaced by face recognition in order to unify personal identification.

Finally, face recognition systems can be used for entertainment purposes in areas like

video games and virtual reality [1].

In research applications, face recognition has opened the door for research in

areas like image and video processing [1]. The approaches used in face recognition are

useful in the general area of pattern recognition and data classification. Research has also

progressed into the realm of neural networks, where the human nervous system is used as

a model to attain higher recognition rates. Lastly, face recognition has paved the way for

advances in the field of computer vision. Any research in face recognition is a step

forward in autonomous vision-based artificial intelligence.

Figure 1: Applications of Face Recognition

Face Recognition Applications

Commercial Research

Computer Vision

Pattern Recognition

Security

Unified PIN

Entertainment

7

1.3 Motivation and Objectives

After extensive research into the field of face recognition, we have found that

there is ample room for improving upon currently available face recognition systems.

These improvements range from the robustness of the design to the speed and accuracy of

the system. An FPGA can provide us with the necessary resources to achieve such

improvements in face recognition. These resources include various communication

interfaces, memory types, and intellectual property cores, as well as one million logic

gates that allow us to implement custom logic.

As such, the objective of our project is to implement a still-image based face

recognition algorithm on an FPGA. We will use a hardware/software co-design

approach, delegating the more mathematically intensive tasks to the hardware while

controlling the algorithm procedure in software. Our aim is to achieve a speed up in the

process of recognition through the use of multiple parallelized components on the FPGA

while maintaining high accuracy in the results.

2 LITERATURE REVIEW

2.1 Still-Image versus Video

In the literature, two main forms of face recognition exist: still-image-based face

recognition and video-based face recognition. Still image face recognition relies on

classifying an individual based on a single image obtained from a still shot camera.

Conversely, video based face recognition relies on a sequence of frames to extract more

information about the face of a subject.

8

An inherent advantage of using still-image-based face recognition over video-

based systems is that the images are of higher resolution. As a result, current face

recognition algorithms are able to recognize a face more accurately. Further to this, still

image based recognition is useful in controlled environments where pose and

illumination are relatively fixed. One example of such an environment is while taking

subjects photograph at the airport check in [1]. The disadvantages of still-image-based

face recognition occur when such a controlled environment is not easily attainable. An

example of this scenario would be a security camera used to identify a subject in a public

place. In this case, video-based recognition yields better results.

The clear advantage of video-based face recognition occurs in situations where

the image resolution is low and the video feed is continuous. Video-based algorithms

capitalize on both spatial and temporal variations in a subjects face. Nevertheless, a

natural disadvantage is the low resolution of the images being captured [1]. Since an

individual might be located at a distance, the pixels that represent this individual’s face

might not constitute a sufficient information base for the algorithm to operate correctly.

Hence, the need for the two different approaches occurs in different situations.

2.2 Algorithms for Face Recognition

2.2.1 Principle Component Analysis

PCA is an algorithm developed by Turk and Pentland that treats face recognition

as a two dimensional recognition problem [2]. The correctness of this algorithm relies on

the fact that the faces are uniform in posture and illumination. PCA can handle minor

variations in these two factors, but performance is maximized if such variations are

9

limited. The algorithm basically involves projecting a face onto a face space, which

captures the maximum variation among faces in a mathematical form.

During the training phase, each face image is represented as a column vector, with

each entry corresponding to an image pixel. These image vectors are then normalized

with respect to the average face. Next, the algorithm finds the eigenvectors of the

covariance matrix of normalized faces by using a speedup technique that reduces the

number of multiplications to be performed. This eigenvector matrix is then multiplied by

each of the face vectors to obtain their corresponding face space projections. Lastly, the

recognition threshold is computed by using the maximum distance between any two face

projections [2].

In the recognition phase, a subject face is normalized with respect to the average

face and then projected onto face space using the eigenvector matrix. Next, the Euclidean

distance is computed between this projection and all known projections. The minimum

value of these comparisons is selected and compared with the threshold calculated during

the training phase. Based on this, if the value is greater than the threshold, the face is

new. Otherwise, it is a known face [2].

2.2.2 Linear Discriminant Analysis

Another popular algorithm used in face recognition is LDA. Although this

algorithm was initially developed for data classification, it has been adapted to face

recognition. Whereas PCA focuses on finding the maximum variation within a pool of

images, LDA distinguishes between the differences within an individual and those among

individuals. That is, the face space created in LDA gives higher weight to the variations

between individuals than those of the same individual. As a result, LDA is less sensitive

10

to lighting, pose, and expression variations [3]. The drawback is that this algorithm is

significantly more complicated than PCA.

As an input, LDA takes in a set of faces with multiple images for each individual.

These images are labeled and divided into within-classes and between-classes. The

former captures variations within the image of the same individual while the latter

captures variation among classes of individuals. LDA thus calculates the within-class

scatter matrix and the between-class scatter matrix, defined by two respective

mathematical formulas. Next, the optimal projection is chosen such that it “maximizes

the ratio of the determinant of the between-class scatter matrix of the projected samples

to the determinant of the within-class scatter matrix of the projected samples” [3]. This

ensures that the between-class variations are assigned higher weight than the within-class

variations. To prevent the within-class scatter matrix from being singular, PCA is usually

applied to initial image set. Finally, a well known mathematical formula is used to

determine the class to which the target face belongs. Since we have reduced the weight of

inter-class variation, the results will be relatively insensitive to variations.

2.2.3 Independent Component Analysis

ICA is the third mathematically-based algorithm for face recognition. Whereas

PCA depends on the “pairwise relationships between pixels in the image database,” ICA

strives to exploit “higher-order relationships among pixels.” [4] That is, PCA can only

represent second-order inter-pixel relationships, or relationships that capture the

amplitude spectrum of an image but not its phase spectrum. On the other hand, ICA

algorithms use higher order relationships between the pixels and are capable of capturing

11

the phase spectrum. Indeed, it is the phase spectrum that contains information which

humans use to identify faces [4].

The ICA implementation of face recognition relies on the infomax algorithm and

represents the input as an n-dimensional random vector. This random vector is then

reduced using PCA, without losing the higher order statistics. Then, the ICA algorithm

finds the covariance matrix of the result and obtains its factorized form. Finally,

whitening, rotation, and normalization are performed to obtain the independent

components that constitute the face space of the individuals. Since the higher order

relationships between pixels are used, ICA is robust in the presence of noise. Thus,

recognition is less sensitive to “lighting conditions, changes in hair, make-up, and facial

expression” [4].

2.2.4 Neural Networks

Unlike the above three algorithms, the neural networks algorithm for face

recognition is biologically inspired and based on the functionality of neurons. The

perceptron is the neural network equivalent of a neuron. Just like a neuron sums the

strengths of all its electric inputs, a perceptron performs a weighted sum on its numerical

inputs. Using these perceptrons as a basic unit, a neural network is formed for each

person in the database. The neural networks usually consist of three or more layers [8].

An input layer takes in a dimensionally reduced (using PCA) image from the database.

An output layer produces a numerical value between 1 and -1. In between these two

layers, there usually exist one or more hidden layers.

For the purposes of face recognition, one hidden layer usually provides a good

balance between complexity and accuracy. Including more than one such layer

12

exponentially increases the training time, while not including any results in poor

recognition rates. Once this neural network is formed for each person, it must be trained

to recognize that person. The most common training method is the back propagation

algorithm [8]. This algorithm sets the weights of the connections between neurons such

that the neural network exhibits high activity for inputs that belong to the person it

represents and low activity for others. During the recognition phase, a reduced image is

placed at the input of each of these networks, and the network with the highest numerical

output would represent the correct match.

The main problem with neural networks is that there is no clear method to find the

initial network topologies. Since training takes a long time, experimenting with such

topologies becomes a difficult task [8]. Another issue that arises when neural network are

used for face recognition is that of online training. Unlike PCA, where an individual may

be added by computing a projection, a neural network must be trained to recognize an

individual. This is a time consuming task not well suited for real-time applications.

2.2.5 Genetic Algorithms

Another biologically inspired algorithm that is commonly used for face

recognition is the Genetic Algorithm (GA). While neural networks mimic the function of

a neuron, genetic algorithms mimic the function of chromosomes. Like neural networks,

genetic algorithms are only well suited for the recognition of a limited number of

individuals and are generally not too scalable.

To start with, the images are divided into two classes: those that belong to the

target person and those that belong to other people. Each of these images is transformed

into a binary coded truth table. Within each of the above mentioned classes, the images

13

are further subdivided into F-tables and T-tables, where each image occupies a row in the

table. Initially, the rows in the F-tables and T-tables do not match. However, by gradually

changing some of the F-table values to don’t-cares, some rows end up matching with

each other. Hence, the F-table obtains the generalization ability. The evolution process

ensures that the modified F-table includes as many rows in the T-table as possible. Once

evolution is complete, the modifications that result in the best fitness are chosen for each

category (target person and unknown people) and applied to the F-table [9].

During the recognition phase, the input image is passed through the tables that

correspond to both categories. Two counters keep track of the number of pixel matches in

each of the categories and the counter with the highest value classifies the input face as

belonging to the corresponding category [9]. The obvious drawback of this algorithm is

that entire tables have to be created whenever a new individual is to be detected. As in

neural networks, the scalability of this algorithm is hindered by the exponential

complexity involved when training for multiple target faces.

Figure 2: Face Recognition Algorithms

Face Recognition Algorithms

Linear Discriminant

Analysis Independent Component

Analysis

Neural Networks

Principle Component

Analysis

Genetic Algorithms

14

2.3 FPGA Implementation of Face Recognition

2.3.1 FPGA Implementation using PCA

One instance of a PCA implementation of a face detection/recognition system on

an FPGA board was done by H. Ando, N. Fuchigami, M. Sasaki, and A. Iwata [5]. The

first stages of implementation were tested on prototype software running on a PC. This

software reads in an RGB image, reduced in size to 100 x 100 pixels. Face detection is

then performed by detecting skin color, and the PCA algorithm is applied to the face area

detected. The prototype software was designed on a XEON-based multi-CPU system

using Visual C++ as the programming language. The input to the system was a USB

camera device connected to the PC [5].

The next step was to implement the face recognition system on an FPGA. As

such, custom hardware blocks were designed in order to carry out the functionality of the

PCA algorithm discussed previously. Specifically, the database of images was

preprocessed, storing the average face and eigenvectors on the FPGA board memory. For

face recognition, an input image vector is fed into the subtraction unit for normalization

with respect to the average face. The processed image is then passed to a

multiplier/accumulator unit that reads the eigenvectors from memory and performs the

projection required by the algorithm. The next stage involves passing this Eigenspace

projection into the matching circuit, which contains an evaluation block that reads the

Eigenspace projections of known faces from memory and performs the necessary

Euclidean distance calculations. Finally, a decision unit reads in these distances and

makes the face recognition decision based on the requirements of the algorithm.

15

The face recognition system achieved a recognition time of 212 µs. The image

size was 20 by 20 pixels and the FPGA board used was a Xilinx Virtex-II Pro (XC2VP7)

clocked at 100 MHz. The system made use of approximately 18% of the gates available

by the FPGA. At a more detailed level, the bit width was 8 bits for the input face, 7 bits

for the Eigenface and 18 bits for the Eigenspace [5].

2.3.2 FPGA Implementation Using Composite/Modular PCA

A second type of FPGA implementation relies on a varied version of the PCA

algorithm, called Composite or Modular PCA. The standard PCA algorithm “considers

the global information of each face image and represents them with a set of weights” [6].

It is because of this fact that the PCA algorithm does not function well when the images

of the test subjects vary in expression, illumination, and pose. Under these conditions, the

weights calculated for the subject will vary more than those stored in the training

database. The composite PCA algorithm divides the face into smaller regions and

computes the weight functions based on those regions. As such, in conditions where pose,

light, or expression varies, only specific regions of the face will vary and therefore only

specific weight functions will change [6].

One possible FPGA implementation of the Composite PCA algorithm utilizes the

inherent parallel nature involved with pixel calculations. The first hardware block

contains four separate processing elements that are used in the recognition phase. A

processing element consists of 20 separate processing lanes, each lane corresponding to a

single eigenvector from the training phase. This parallel processing unit is used to

compute the Eigenspace projections required by the PCA algorithm prior to performing

the distance calculations. The second hardware block is a classification module that

16

serves two main functions. The first function is that of an accumulator that reads results

from the processing elements and accumulates the results in registers (one register for

each of the twenty processing lanes). Secondly, the classification block finds the face

with the minimum distance to the face under test and stores its index [7].

The above mentioned system was implemented on an Altera Quartus board

clocked at 91 MHz. It was able to recognize a face from a database of 1000 images in 11

milliseconds. The performance of this implementation can be attributed to the parallel

hardware blocks used in performing the necessary calculations for the algorithm. Further

to this, the design can be scaled for larger databases by simply adding more processing

elements in parallel. This will yield an even higher throughput of data and improved

performance for larger sized databases [7].

Another FPGA implementation strategy that yields some good performance

results with Composite PCA relies on two process blocks, 16 pairs of which are

connected in parallel for high throughput calculations. The first block reads in the

eigenvectors and the test image and performs the necessary multiplications. This result is

then passed to the second processing block, which computes the distance using a reduced

formula designed to simplify the hardware implementation of distance calculations. All

16 blocks are connected to a distance grouper and a comparator, used to eliminate all

redundant distance calculations and find the smallest distance, respectively [6].

The above hardware design was implemented on an Altera Quartus II board

(clocked at 100 MHz) and was able to perform face recognition on a database of 10 faces

in 3.88 milliseconds. A total of 7,820 logic elements were used, 2,348 of which were flip-

17

flops. Again, performance can be attributed to the highly parallel nature of the hardware

design and the composite algorithm used.

2.3.3 FPGA Implementation using Artificial Neural Networks

Neural network algorithms for face recognition have been applied extensively on

FPGA boards. Li and Areibi undertook such an implementation in using both a soft-core

processor and a hardware module, referred to as a co-design approach. The face database

used consisted of 20 individuals, each having a set of 32 images that vary in expression

and direction. The images were grayscale and 8-bit, having dimensions of 120 by 128

pixels. The neural network used consisted of three layers, and with the pixel intensity (0

to 255) used as the input.

The face recognition system was implemented in two phases, a training phase and

a testing phase. In the training phase, image data is sent to a Target Generator, which

encodes the images and feeds them to the Learning System. The output of the Learning

System is then compared to the output of the Target Generator, and the difference is “fed

back to the Learning System to further decrease error.” These systems jointly implement

the forward, backward, and updating calculations typical of neural network training. The

testing phase of the implementation simply consists of and Image Sender, a trained

Learning System, and an Output Interpreter. The Image Sender sends the image to the

Learning System, whose output is passed through the Output Interpreter. This interpreter

extracts data from the Learning System to classify the individual [8].

For the implementation, a Xilinx Virtex-II XC2V2000 FPGA board was used.

The feed-forward and backward calculations were implemented on a MicroBlaze core,

while the updating calculations were delegated to a Hardware Update Module (HUM).

18

Both the C program and the training images were stored on BRAM, and the system

included peripherals such as OPB UART and the OPB GPIO bus. In order to increase the

speed of the neuron updating process, the HUM contains 4 parallel update units that are

capable of updating 4 neurons at a time. A finite state machine controls the floating-point

multipliers and the output is stored on 4 local registers. The MicroBlaze then reads off

these registers continuously until the update process is complete [8].

The results of the experiment showed that the HUM occupied 42% of the FPGA

and the MicroBlaze occupied 7%. Feed-forward and backward computations took around

20 ms to complete, MicroBlaze software updating took 173 ms, and HUM hardware

updating took 1.4 ms. The speed-up in updating was over 10x while the speed-up over a

software implementation was around 1.7x. This demonstrates that the algorithm contains

inherent parallelism which cannot be exploited with a general-purpose processor [8].

2.3.4 FPGA Implementation using Genetic Algorithm

Yasunaga, Nakamura, and Yoshihara implemented the genetic algorithm on an

FPGA chip with the intention of creating a personal identification system. After applying

the chromosome evolution technique to the F-tables of the target individual and other

people, the two tables were synthesized on an AND gate plane. That is, for each category,

the input image is fed into an AND gate grid with both connected and disconnected

nodes, representing bits and don’t-cares, respectively. The output of each AND gate

category is then fed into a counter unit which is designed to keep track of the number of

activated AND gates. Finally, a maximum detector unit selects the counter output with

the highest value and classifies the face as belonging either to the target person or to an

19

unknown person. Since the input image is fed to all the AND gates simultaneously, the

matching process is carried out in parallel [9].

During the training process, 8-bit images were used to represent the faces. To

implement the chromosome evolution technique, the byte that represents each pixel was

manipulated at 8 different levels. The first level replaces the least significant bit with a

don’t-care, and each level gradually adds a don’t-care to the next least significant bit. The

last level consists of all eight bits replaced by don’t-cares. Moreover, to test the

implementation, a database of 100 images was used, representing 5 individuals in 20

different poses. The dimensions of the original images were 240 by 240 pixels, but they

were preprocessed and reduced to 8 by 8 pixels. Also, the F-tables and T-tables were

each assigned 10 of the 20 poses for the individuals [9].

To synthesize the circuits, a logic synthesizer was employed. The average number

of gates required for each person, including the counter and maximum detector units,

amounted to 1,334. The presence of don’t-cares allowed the number of gates to be

collapsed by less than 1/10. Using an FPGA board with a Xilinx XC4010 chip, the

identification accuracy of the system was 97.2% and identification took place within 1

µs. This is due to the intrinsic hardware parallelism found in the AND gate planes.

Furthermore, fault tolerance tests were made on the system. Random stuck at 0s or 1s

were injected at the outputs of the AND gates, and an accuracy upwards of 90% was

maintained even with a stuck-at faulty gate ratio of 18%. Additionally, the system

exhibited graceful degradation as more stack-at faults were introduced [9].

20

2.3.5 FPGA implementation using Evolutionary Reconfigurable Architecture

A final implementation strategy for face recognition algorithms on an FPGA

involves using an evolutionary reconfigurable architecture. This approach is used in order

to enhance the functionality of face recognition in situations where the environment

varies. Under this architecture, there are three main stages to face recognition: a

reconfigurable filter module (RFM), a reconfigurable feature space module (RFSM), and

an evolutionary module (EM) [10].

The RFM is first used in order to enhance the quality of the image. This module

consists of four different filters that operate on the image. The first filter is a median filter

that is used in order to remove impulse noise in the image. Next, a histogram equalization

filter is used to improve the contrast of an image. In the third filtering stage, a

homomorphic filter is used in order to improve the reflectance effect of an image and

reduce the effect of lighting. It does so by “reducing brightness and emphasizing contrast

in a frequency domain” [10]. In the final filtering stage, an illumination compensation

filter is used to improve the brightness of the image. The RFSM module uses a Gabor

wavelet in order to reduce redundancy and noise in the image. The EM module finally

uses a survival-of-the-fittest concept to filter out through each known face and perform

face recognition using a genetic algorithm [10].

In hardware, only the RFM module and the EM module were implemented on an

FPGA while the RFSM module was implemented on a host computer. The RFM was

implemented on an RC1000-PP board, based on the Xilinx Virtex E-2000 FPGA. The

filters were coded using C and then moved to hardware, where they were implemented in

parallel in order to improve processing time. This was made possible by the existence of

21

4 separate SRAM banks that can be accessed simultaneously by the FPGA. As such, 4

different images could be processed in parallel. The EM module was implemented by a

“hybrid parallel genetic algorithm processor [10]”.

For testing, 386 images of 39 people were stored in the database. Each image was

comprised of 128 by 128 gray pixels. It took approximately 1,000 iterations on the

images in order for the optimal filter combination to be obtained. After evolution, face

recognition rates increased by 63.4% when using images with poor illumination and

noise. When noise was added to the image, the rate increased by 36.5%. These figures

demonstrate the robustness of the system to changes [10].

2.4 Issues with Face Recognition

Although face recognition systems have advanced remarkably over the past few

years, there still exist some major obstacles that need to be overcome. In general, still-

image face recognition accuracy fades away as image variations are increased. The main

image variations are illumination levels, pose variation, and changes in facial expression.

Moreover, the problem of face detection, or the extraction of a face from an image, is a

required first step for face recognition.

The illumination problem occurs in an uncontrolled environment where “the same

face appears different due to a change in lighting” [1]. The problem is emphasized when

the variations in lighting are greater than the variations between people. One solution to

this problem involves preprocessing the images and introducing contrast normalization

and compensation. Another approach attempts to reconstruct all possible lighting

variations from a selection of training images for each individual. A third method relies

22

on creating a separate linear illumination subspace. This is similar to the space created to

capture face variations, except that it captures lighting variations [1].

Pose variation also impairs the face recognition process. Pose variation becomes

especially pronounced when it is combined with illumination changes. One solution to

the pose variation problem involves obtaining images with multiple views of an

individual. In this case, multiple poses are available during both training and recognition.

During the recognition process, each pose is aligned with a similar pose in the database to

achieve correct classification. The obvious drawbacks are that multiple views of an

individual are not always available. A more popular solution involves using multiple

poses during training but only a single pose during recognition. One such implementation

creates an Eigenspace for each pose to achieve pose-invariant recognition [1].

The problem of facial expression variation is also common in the literature. If

only one image of an individual is available, recognition accuracy drops considerably.

However, if many images are available, algorithms like PCA can absorb these changes. It

is important to note that during expression changes, parts of the face remain largely

unchanged. As a result, algorithms that segment the face are more robust to these

variations [1]. Many databases available today contain training images with multiple

expressions, and face recognition systems have been capable of making accurate image

classifications despite expression variations.

Lastly, it is important to discuss face detection in the context of the face

recognition problem. The need for face detection arises when one or more faces must be

extracted from an image. Furthermore, face detection and extraction is essential to reduce

external factors that might hinder the recognition process. One common method of face

23

detection relies on the use of Haar classifiers. These classifiers sweep through the image

and apply several filters to detect the presence of a face. Another method, mentioned

earlier, relies on skin color to detect a face.

As such, face recognition is a growing field with potential applications in security,

entertainment, and personal identification. The recognition algorithms can be grouped

into mathematical/statistical (PCA, ICA, LDA) algorithms and biological (NN, GA)

algorithms. Many of these algorithms have been implemented by several researchers on

FPGA boards with high recognition rates and recognition times within the margin of real-

time applications. However, long training times and the scalability of face recognition has

been a recurring concern in all of these implementations. Finally, common face

recognition problems include illumination changes, pose variations, and the issue of face

detection and extraction.

3. DESIGN

3.1 System Specifications

3.1.1 Algorithm

Having researched the various algorithms for face recognition, we found that the

two most popular hardware implementations are PCA and Neural Networks. As stated

before, the advantage of PCA is its robustness, parallelizability, and relative simplicity.

Its disadvantages are its sensitivity to lighting and pose variations. On the other hand, the

Neural Networks approach provides strong accuracy but limits the number of individuals

that can be included in the database due to the long training periods involved.

24

We have chosen to adopt the PCA algorithm for face recognition for several

reasons. Firstly, the environment that will be used to obtain the individual face images is

controlled and hence lighting and pose variation effects can be minimized. Secondly,

since a face can be subdivided into multiple regions, pattern recognition can be applied in

parallel, resulting in faster face recognition. Lastly, PCA allows us to quickly add

individuals to the face database, making it better suited for real time applications.

3.1.2 Inputs and Outputs

The inputs of our system consist of bit streams representing the image to be

analyzed, an average face, an Eigenvector matrix, and a set of projections. The image to

be analyzed, as well as the average face, will consist of 150 × 125 = 18,750 pixels, each

being 8-bit grayscale (28 = 256 shades of gray, ranging from 0 to 255). These figures

were chosen because they provide a good balance between size and accuracy.

Additionally, these values were used successfully by many research groups. In a database

with M faces, the Eigenvector matrix will be of size 18,750 by M. Finally, the set of

projections will consist of M vectors, each having M values.

The outputs of our system will consist of the face ID with the closest match, as

well as a value representing how close this match is (a distance value). Furthermore, the

system outputs execution times to gauge the speed of the system, as well as each of the

functions involved in the recognition stage. All this information will be displayed on the

HyperTerminal of the PC.

The inputs of our system will be transmitted to the FPGA through an Ethernet

link. The choice to use Ethernet over RS232 was motivated by the difference in transfer

rates. While serial connections operate at less than 20,000 baud (although some systems

25

currently exceed this limit), Ethernet connections operate at 10 Mbits/second. Naturally,

this will imply faster system initialization times. The output of our system will be sent to

the HyperTerminal over a serial interface, primarily because there is no need for high

speeds in order to display character sequences on the PC.

The functionality of the system will closely follow Principle Component Analysis

(PCA) algorithm for face recognition. This involves normalizing the image, projecting it

onto face space, and computing the Euclidean distance to all M projections. The

projection phase will be implemented using the on-board multiplier, whereas

normalization and comparison will be implemented on the MicroBlaze core.

3.1.3 Timing Constraints

The time constraints of our system are bounded by the time constraints required

by real time face recognition. That is, in applications where the recognition of multiple

faces is required, the process must not take more than 2 seconds for each person.

However, this figure includes the overhead of obtaining the image, performing face

recognition, and displaying the results. Normally, the face recognition process is in the

order of milliseconds, as mentioned in the literature review.

3.2 System Description

From a top level perspective, the face recognition system consists of an FGPA

end and a PC end that communicate via an Ethernet interface. On the PC end, a C

program runs the training stage of the algorithm and produces binary data files, which are

relayed to the C# Ethernet program. This program encapsulates the data files and sends

them to the FPGA over Ethernet.

26

Figure 3: System Block Diagram

27

On the FPGA end, the data files are received, parsed, and stored in DDR memory.

Then, the recognition stage runs on the MicroBlaze core with the assistance of on-board

multipliers and the results are displayed on the HyperTerminal through a serial interface.

3.3 Hardware FPGA Components

3.3.1 V2MB-1000 Overview

The board we will be using to implement our system is Memec V2MB-1000

Development Board. The V2 is derived from the name of the Xilinx FPGA on the board

(the Virtex-II XC2V1000), the MB from the MicroBlaze processor core, and the 1000

from the fact that the FPGA consists of 1,000,000 programmable logic gates. The Virtex-

II XC2V1000 FPGA is designed for high performance applications in the fields of

networking, telecommunications and digital signal processing, among others.

In addition, it supports various I/O standards including Low Voltage Differential

Signaling (LVDS), Peripheral Component Interconnect (PCI) and Dual Data Rate (DDR).

LVDS is a high-speed type of signaling that uses twisted-pair copper cables. PCI is a bus

standard that allows for high-speed communication between peripheral devices and a

central processor. DDR allows for quick memory access by means of transferring data on

both the rising and falling edge of the clock. The XC2V1000 FPGA contains 90KB of

Block Select RAM (BRAM) memory, which can be used for fast-memory access

operations. In addition the V2MB-1000 board contains 32MB of external DDR memory

called the ZDT memory.

The board additionally contains two 7-segment LED displays as well as a single

LED display. It also has four push buttons that can generate an active low signal and

eight DIP switches that can generate both an active high and an active low signal.

28

Moreover, the V2MB-1000 board contains an RS232 port that allows for serial

communications, and a JTAG port, which is connected to the parallel port of a PC so that

bit stream configurations can be downloaded to the FPGA.

Figure 4: V2MB1000 Board

The board we are using comes with the P160 Communications Module-2

expansion. This module provides us with several different functions, but our use of the

board is restricted to Ethernet. This function consists of a Broadcom chip and an RJ45

connecter to which the Ethernet cable is hooked up.

29

Figure 5: P160 Communications Module

3.3.2 MicroBlaze Processor

The MicroBlaze processor is at the heart of the face recognition system. It runs

the main algorithm, communicates with peripherals, and delegates computationally

intensive operations to the custom multipliers/accumulators. The processor core

interfaces OPB interface, a BRAM Interface/Controller, an External Memory controller,

an Ethernet driver, and the custom multipliers.

3.3.3 OPB Interface

The OPB interface is used for interfacing with general purpose I/O devices,

interrupts, and timers. I/O devices might be used as indicators or as switches to enhance

the functionality of the system. One example of this might include blinking a LED to

indicate the status of certain operations. The inputs of this unit are supplied by the

30

MicroBlaze core as well as by input devices or interrupts. The outputs consist of the

MicroBlaze processor and any output indicators that may be used.

3.3.4 BRAM Controller

The BRAM interface controller is used by the MicroBlaze processor to implement

instruction and data memory banks. Since the MicroBlaze writes data to and reads

instructions and data from the BRAM memory, the BRAM controller is essential to

provide the interface needed.

3.3.5 External Memory Controller

The External Memory Controller is used to interface with external memory on the

V2MB1000 board. It provides a means for the MicroBlaze processor to exchange data

with the ZDT external memory.

3.3.6 Ethernet MAC Controller/Driver

The Ethernet MAC controller is the peripheral lying on the P160 Communications

Module that allows communication with the Ethernet port. This module contains an

Ethernet port and a Broadcom chip that converts the stream of bits into bytes that are

accessible by the MicroBlaze. It contains 2 FIFO queues, one that will be used to store

the incoming packets, and the other used to queue outgoing packets. These buffers can

each occupy up to 32KB of the BRAM. All matrices will be sent to the board through the

Ethernet port.

The Ethernet MAC driver provides us with Xilinx C functions used to initialize

the EMAC and send and receive frames. We have familiarized ourselves with these

functions for later use. An example of such a function is the XEmac_FifoRecv function

31

that receives a frame from the Ethernet port and stores it in a specified memory location

on the board. In addition, we were able to add the EMAC core to the system using the

corresponding pin constraints and correct signal matching.

3.3.7 UARTLite Controller/Driver

The UARTLite controller provides an interface to the serial port and RS232 cable.

Serial communications in our project will involve sending text information to the

HyperTerminal of the connected PC concerning the status of operations on the

V2MB1000 board. The UARTLite driver contains several useful functions such as the

XUartLite_SendByte function that sends a byte of data (for example a character).

3.3.8 On-board Hardware Multipliers

Utilizing the hardware multipliers requires some knowledge of projecting an

image to face space. Such a projection requires multiplying an M × 18,750 matrix with an

18,750 × 1 matrix. Again, M stands for the number of faces in the database. This implies

that there will be a total of M × 18,750 multiplication operations.

Our design uses the on-board dedicated hardware integer multipliers to implement

this matrix multiplication. We found that the multipliers are 18 bit × 18 bit, and each is

associated with an 18 Kbit block of BRAM. Since there are 40 such blocks, up to 40 on-

board multipliers can be used. Furthermore, software techniques are used in order to

accumulate the results within each stage of matrix multiplication. The rationale behind

this decision comes from the fact that the overhead incurred by hardware accumulation

results in longer execution time for the algorithm, as it involves writing to and reading

32

back from DDR memory. This concept will be discussed further in the implementation

section of this report.

3.4 Software FPGA Components

The software component on the FPGA board consists of the C code that runs on

the MicroBlaze. This code implements the face recognition algorithm and communicates

with the Xilinx and custom IP peripherals. The software component consists of several

phases, namely receiving a face vector, normalizing it, projecting it onto face space,

computing the distance to known faces, and finding the minimum distance.

To receive a face 150 × 125 pixel image vector, the code on the MicroBlaze must

receive a stream of 150 × 125 × 8 bits (or 18,750 bytes) sent by the application through

the Ethernet cable. As a result, the code must be able to communicate with the Ethernet

driver, which interfaces to the Ethernet port. Once the face is received, it is stored in

memory and normalized. Normalization consists of subtracting the vector of the average

face from the vector of the face in question. Thus, it involves the subtraction of two

vectors with 18,750 8-bit entries, both of which reside in DDR memory.

After the face is normalized, it must be projected onto face space. This involves

multiplying the Eigenvector matrix (M × 18,750) with the normalized face vector (18,750

× 1). To speed up processing, this operation is sent to the multipliers. The result of the

multipliers, a 32-bit vector of size M, is sent back to the MicroBlaze. During this process,

the elements of the Eigenvectors matrix are fetched sequentially from DDR memory and

the resulting projection is stored in BRAM for fast access.

Next, the face projection must be compared with every projection on the

projection database. This requires finding the Euclidian distance between the projection

33

of the current face and the projection of each of the faces. In mathematical terms, the

magnitude of the difference between each pair of size-M vectors must be computed.

Although the operation is mainly subtraction, we did not design a custom hardware

comparator since we realized that calculating the distance to all the face projections is not

a bottleneck. For this operation, the face projection is stored in BRAM whereas the list of

projections is found in DDR memory, where they were dumped during the initialization

phase of the system. Lastly, the code on the MicroBlaze must then transmit the above

results to the HyperTerminal through the serial interface for the user to see. As mentioned

earlier, these results include the projection distances and the ID of the recognized face.

3.5 PC Application Components

The application of the system mainly deals with executing the training stage and

initializing the FPGA. Specifically, a C program runs the training stage of the PCA

algorithm and produces an average face, an Eigenvectors matrix, and a projections

matrix. These three portions of data, along with the test face to be recognized, are written

to binary files in the same C application.

At this point, a C# application takes over. This program reads each of the four

binary files produced by the previous program and encapsulates the data into Ethernet

frames. It then sends these Ethernet frames to the FPGA for initialization and displays

status messages confirming that the sending operation took place.

3.6 Memory Management

We used two types of memory on the V2MB1000; the BRAM Memory and the

ZDT External Memory. BRAM memory contains 40 blocks of size 18Kbits each for a

34

total of 720Kbits or approximately 90KB. It is located on the Virtex-II FPGA itself, and

thus has the fastest access time compared to all other types of memory on the board.

External memory is essentially a 16M × 16 DDR memory that provides us with 32MB of

storage space. This memory lies on the board external to the FPGA, and thus has a longer

access time. Ideally, we would have opted to store all data in BRAM memory, but due to

the constraint in size, we are forced to store the data initially in External memory. Below

are the memory requirements assuming a 150 × 125 pixel image:

Target face:

Number of entries = 150 × 125

Bits per entry = 32 (since of type Xuint32)

Total memory for target face = 150 × 125 × 32 = 600,000 bits = 75 KB

Average face:

Number of entries = 150 × 125


Total memory for average face = 150 × 125 × 32 = 600,000 bits = 75 KB

Projections Matrix:

Number of entries = 51 × 51 (assuming database contains 51 individuals)


Total memory for projections = 51 × 51 × 32 = 83,232 bits = 10 KB

Eigenvector matrix:

Number of entries = 51 × 150 × 125 (assuming database contains 51 individuals)


Total memory for Eigenvectors = 51 × 150 × 125 × 32 = 30,600,000 bits = 3,825 KB

35

We therefore conclude that a total of 75 + 75 + 10 + 3,825 = 3,985 KB

(approximately 4 MB) of external memory must be used. As data enters the board

through the Ethernet port, it is stored in consecutive external memory locations.

Subsequently, these elements are fetched from memory back to BRAM as required by the

PCA algorithm. Although this incurs an overhead of data transfer, it is a price that must

inevitable be paid in order to ensure the functionality of the system.

3.7 PCA Algorithm

3.7.1 Training Phase

1. Each face in the database is represented as a column in a matrix A. The values in

each of these columns represent the pixels of the image and range from 0 to 255 for

an 8-bit grayscale image:

11 1

1

n

m mn

a a

A

a a

=

K

M O M

L

2. Next, the matrix is normalized by subtracting from each column a column that

represents the average face (the mean of all the faces):

11 1 1 1

1

n

m m mn m

a m a m

A

a m a m

− −

=

− −

Kur

M O M

L

3. We then want to compute the covariance matrix of A, which is A × AT, but since the

operation is very mathematically intensive, we use a shortcut:

L = AT × A

4. To obtain U, the matrix of covariance eigenvectors, we find V, the matrix of

eigenvectors of L, and calculate:

36

U = A × V.

5. Each face is then projected to face space:

Ω = UT × A

6. We next compute the threshold value for comparison:

θ = ½ × max || Ωi – Ωj ||, for i, j = 1…n.

3.7.2 Recognition Phase

1. We represent the target face as a column vector:

1

m

r

r

r

=

M

2. The target face is then normalized:

1 1

m m

r m

r

r m

−

=

−

rM

3. Next, the face is project to face space:

Ω = UT × r

r

4. We then find the Euclidean distance between the target projection and each of the

projections in the database:

ε2 = || Ω – Ωi ||

2 for i = 1…n

5. Finally, we decide if the face is known or not by selecting the smallest distance and

comparing it to the threshold θ. If it is greater, then the face is new. Otherwise, the

face is a match.

37

3.8 Project Budget

Below is a list of hardware components that we will be using to implement our

face recognition system. The prices reflect an approximation of the current market price

of the components.

Ethernet CAT5 Cable (10 ft): $5

Figure 6: Ethernet CAT5 Cable

RS-232 Serial Cable: $5

Figure 7: RS-232 Serial Cable

V2MB-1000 Development Kit (+ P160 Comm. Module-2) with ISE Foundation and

JTAG cable: $2995.00 (This product was provided by the American University of

Beirut.)

38

4. IMPLEMENTATION 4.1 Modeling Algorithm in MATLAB 4.1.1 Implementation Details

A free database of faces, non faces, and new faces was used as a means to test the

implementation developed. The database of faces consists of 51 images each having

dimensions of 150 × 125 pixels represented as row vectors. Each pixel contains an 8 bit

grayscale value representing 1 of 256 possible shades of gray. The MATLAB

implementation followed the algorithm details outlined above and used built-in

MATLAB functions to achieve functionality. Some of these functions are outline in the

table below.

MATLAB Function Description mean (A) Calculates the mean of matrix A

A’ Calculates the transpose of matrix A

eigs (A, k) Determines the fist k eigenvectors and eigenvalues of A

dist (A, B) Determines the Euclidean distance between matrices A and B

Table 1: MATLAB Functions Used

In addition, we used functions to visualize the Eigenfaces as well as the average

of the faces in the database. Below is a sample of the images produced.

20 40 60 80 100 120

20

40

60

80

100

120

140

20 40 60 80 100 120

20

40

60

80

100

120

140

Figure 8: Sample Face Figure 9: Average Face

39

100 200 300 400 500 600

50

100

150

200

250

300

350

400

450

Figure 10: Eigenfaces

4.1.2 Implementation Results

When trying to recognize a face image that already exists in the database, the

projection distance calculated for that specific image is zero. When an image of a known

person is used, but that image is not the exact one in the database, the distance turns out

to be the smallest of all the distance vectors. This is consistent with the algorithm outline

above. Specifically, the following table illustrates a sample of the distance calculations

when using the third face from the database as the test face for recognition. We can also

see that the next closest distance calculations correspond to the other images of the same

person in the database.

Face Index Person # Distance to Face Index 3 1 1 2.8476 × 10

7

2 1 2.7966 × 107

3 2 0.0000 × 107 4 2 0.1591 × 107

5 2 0.4335 × 107 6 3 1.9659 × 10

7

7 3 2.0871 × 107

8 3 2.1734 × 107

Table 2: MATLAB Implementation Results

40

4.1.3 Exporting Code to C

Since the MicroBlaze core only has a C compiler, the MATLAB code above has

to be exported to C. In order to achieve this, we used the “mcc” function in MATLAB,

which exports m-files to C-files. However, we noticed that the resulting code was 1.13

MB, which is too large to fit on the BRAM of the FPGA board. Additionally, the code

contained unwanted libraries and header files that result in a very high overhead. Lastly,

we encountered several linking problems when attempting to execute this conversion. As

a result, we chose to write a custom matrix library in C and implement the algorithm

manually using our own library functions.

4.1.4 Exporting the Database

The C implementation of the algorithm required that we use the same database

that was used when modeling the algorithm with MATLAB. This was done to ensure that

the results of our C implementation coincided with the results we had achieved from our

first modeling attempt.

In order for our C code to read the database, we had to first export the three

different elements of our database, faces, non-faces, and new faces, to three respective

binary files. The following MATLAB code illustrates writing a matrix m_towrite to a

binary file, where m_towrite represents any one of the database matrices faces, non-faces,

and new-faces.

% first create the matrix m_towrite containing the database

fid = fopen('database.txt','wb')

x = fwrite(fid,m_towrite,'float32')

fclose(fid)

The first line of the code creates a file ID that we will write to in ‘wb’ mode, or

write binary mode. Next, we write the matrix m_towrite to the file ID created in the

41

previous line. The matrix is written using 32 bit floating point representation. Finally, we

close the file that we have created in the last line.

4.2 Training Stage Implementation in C

4.2.1 Custom Function Library

Having verified the correctness of our MATLAB implementation, we then

proceeded by coding the training stage of the PCA algorithm in C. However, prior to

doing so, we had to code a library containing a set of functions to facilitate the

implementation of the PCA algorithm. The matrix library consists of several functions

pertaining to the requirements of the PCA algorithm. The data used for all the functions is

of type float*. Below is a summary of the functions:

C Function Description Matrix Transpose Calculates the transpose of a matrix

Matrix Average Finds the average row in a matrix

Matrix Multiply Multiplies two matrices

Matrix Subtract Subtracts a row from every matrix row

Vector Distance Calculates the Euclidean vector distance

Eigenvectors Finds the Eigenvectors/Eigenvalues of a matrix

Read Images Reads image database binary file

Read Test Face Reads test face binary file

Table 3: C Functions Used

The eigenvectors function was obtained from a freely available internet source

and it is modeled after the algorithm outlined in Numerical Recipes in C [11]. This

algorithm computes the Eigenvalues and Eigenvectors of a real symmetric matrix using

Jacobi rotations. Once we completed the implementation of the library of functions, we

could then proceed with the implementation of the training stage itself.

42

4.2.2 Training Stage Calculations

The first task was to declare several single and double subscripted arrays of type

float to accommodate various initial, intermediate, and final matrices involved in the

PCA algorithm. Next, we allocated memory for all the matrices using the malloc

function, and then proceeded with the calculation of the matrices. Prior to this, however,

we used the read_images function to read all the images in the database and store them in

an array. This function opens the file “database.txt” and reads all the binary data from it

using the read binary option. It then stores all the bytes in an array called db which it

returns to the main function as shown below:

read_images(database,NUMFACES, FACESIZE);

At this stage, all the binary data concerning the faces in the database are available

for use and are stored in an array called "database" as shown in the function above. The

first matrix we had to calculate was the average matrix. To do so, we used the

matrix_average function. This finds the average pixel values by adding all the pixels of

the 51 faces in one position and dividing them by 51. The function takes in the "database"

array and returns "average" which is a single vector of size FACESIZE.

matrix_average(database,NUMFACES,FACESIZE,average);

Once we obtain the average we must normalize the entire database by subtracting

the vector "average" from every face vector in the "database" array. Normalization thus

describes how similar each face in the database is compared to the average face. The

function call is shown below:

matrix_subtract(database,NUMFACES,FACESIZE,average,database);

43

The "database" array now contains all the normalized face vectors. From this

point on, we will use these normalized vectors and not the original ones. The next step in

the algorithm is to find database × database_transpose. Since this will result in a huge

number of multiplications and a huge array, a trick is used in which we perform

database_transpose × database instead. First in order to transpose the "database" array

we created a simple function in which we replaced every row with a column. The

function call is shown below:

matrix_transpose(database,NUMFACES,FACESIZE,database_trans);

The details of all of these functions we used are available in the appendix section

for further reference. Once we obtain the transpose, we may now perform the above

multiplication operation. Below is the function call. The function takes in database_trans

as the first operand, and database as the second and stores the corresponding product in

matrix L.

matrix_multiply(database,NUMFACES,FACESIZE,database_trans,FACESIZE,NUMF

ACES,L);

Matrix L is of size NUMFACES × NUMFACES (51 × 51) rather than

FACESIZE × FACESIZE (18750 × 18750). This saves a lot of memory and is almost just

as accurate. The next step is to compute the eigenvectors of matrix L. We do so using the

Eigenvector function we created. This function results in an array that contains the

Eigenvectors of L as shown below.

eig(L,NUMFACES,eigenvalues,eigenvectors);

The above two operations are a part of the trick used to minimize the size of the

array that would result from multiplying database × database_transpose. They are

intermediate operations that lead to obtaining the Eigenvectors of the original matrix

44

"database". The last step of this alternative method is to compute the Eigenvectors of the

original matrix by multiplying database_transpose by the Eigenvectors (obtained in the

above operation). This will result in the eigenvectors of database. The function is shown

below:

matrix_multiply(database_trans,FACESIZE,NUMFACES,eigenvectors,NUMFACES,

NUMFACES,eigenvectors_orig);

The result is stored in eigenvectors_orig which is of size FACESIZE ×

NUMFACES. Now that we have the Eigenvectors matrix we can determine the

projections of each face onto the face space by multiplying the "database" matrix by the

eigenvectors_orig array that we obtained above.

matrix_multiply(database,NUMFACES,FACESIZE,eigenvectors_orig,FACESIZE,N

UMFACES,projections);

This operation in effect highlights the key features of every face by projecting it

onto the face space. As such, when a new face is brought in to the system for recognition,

determining whether it is a match would take a much smaller amount of time. Once we

completed the calculation of the average, eigenvectors_orig, and projections matrices, we

decided to truncate the elements of the arrays. That is, up until this point, all the arrays

were of type float. However, since floating point calculations take significantly longer

than integer calculations, truncating the digits after the decimal point and changing the

type to integer saves a lot of computation. Before doing so, we compared the values

obtained in both the integer and floating point cases and found that the error due to

truncation is negligible (less than 0.01%). To do so, we simply created new integer

arrays, and copied the floating point values. This automatically truncates anything after

the decimal point. This process is shown below.

45

int *average_int;

int **eigenvectors_orig_int;

int **projections_int;

for(i=0;i<FACESIZE;i++)

average_int[i] = average[i];


for(j=0;j<NUMFACES;j++)

eigenvectors_orig_int[i][j] = eigenvectors_orig[i][j];

for(i=0;i<NUMFACES;i++)

for(j=0;j<NUMFACES;j++)

projections_int[i][j] = projections[i][j];

At this point, the above three arrays are ready to be written to a file and sent

through the Ethernet port to the FPGA.

4.2.3 Writing to Binary Files

Having obtained the truncated values for the intermediate data, we proceeded by

writing them to binary files. To do this, we first opened a file in write binary mode. We

also performed error detection by checking the value returned by the fopen function. The

code snippet below illustrates this:

if (!(f_testface = fopen("b_testface_int", "wb")))

return 1;

We next write to the binary file by invoking the fwrite function. However, for the

cases of the projection and Eigenvector matrices, caution was exercised to ensure that the

indexing of the matrices corresponds with the storage format. That is, the binary files will

store the data in a linear manner and consistency must be maintained when un-wrapping

two dimensional data into a linear space. By maintaining this indexing consistency at the

receiving end, we were able to reproduce these two dimensional matrices without the loss

or corruption of data. The following illustrates one example of this:

46


fwrite(eigenvectors_orig_int[i], NUMFACES*sizeof(int),1,

f_eigenvectors);

Finally, the file is closed and the process is repeated for all 4 segments of binary

data that are required by the recognition stage.

4.3 Ethernet Implementation in C#

As mentioned previously, we chose to use Ethernet to send data from the PC end

(training stage) to the FPGA end (recognition stage). However, in order to do so, we first

had to write a program to send Ethernet packets. To facilitate our work, we obtained an

open source program in C# containing functions for writing packet bytes to the Ethernet

adapter. We then customized the code to suit our needs.

Specifically, the C# program describes a class called RawEthernet for sending

raw Ethernet packets. Furthermore, it describes a method that retrieves all the network

devices of the system, thereby prompting the user to select the relevant network device

for the sending operation. Finally, the program details a DoWrite operation that actually

writes the packets to the Ethernet adapter.

In order to send customized frames containing data from the training stage, we

first create a variable called packet of type byte array. Our packet size was 1014,

resulting in 1000 bytes for data. Below is the declaration statement:

int TOTAL_PACKET_SIZE = 1014;

int DATA_SIZE = TOTAL_PACKET_SIZE - 14;

byte[] packet = new byte[TOTAL_PACKET_SIZE];

The next step involved reading the 4 binary files that were produced by the C

code and sequentially formatting and storing them into packets. In C#, opening and

closing a file stream for reading amounts to the following statements:

47

FileStream fs = File.OpenRead("b_testface_int");

BinaryReader br = new BinaryReader(fs);

br.Close();

fs.Close();

We started by initializing the packet to be sent with the appropriate destination

address, source address, and frame length. The destination address was made to match the

hardware address allocated on the FPGA board. The source address was also made to

match the address that the FPGA uses for filtering packets. It is important to make sure

that the consistency is maintained in both ends so that important packets do not get lost

and unwanted packets do not get through.

Having performed the necessary initializations, we proceed by reading the first

binary file. We loop from 0 to 75000, representing the number of bytes in the file, and

pick off each byte to store it in the previously defined packet. This operation continues

modulo 1000, so that every time 1000 bytes are written to the packet, the packet is sent

using the previously mentioned DoWrite function.

Once the binary file has been read, we repeat a similar process for each of the

other 3 binary files. For the last file, of size 10404 bytes, we faced the problem that the

binary file size was not a multiple of 1000 bytes. For this case, a statement was added to

the loop that checks for the last iteration and writes the last 404 bytes of the file prior to

breaking from the loop.

In terms of display, the program first prompts the user to select the desired

network device. Once that is done, the packets are sent in sequence, with a notification

message appearing on the screen after each packet is sent. Once all the packets have been

sent, the program displays a message confirming that all the desired packets were sent.

48

4.4 Recognition Stage Implementation on FPGA

4.4.1 Receiving Ethernet frames and Storing to DDR

Once the C# program on the sending end was fully functional, we moved on to

implement the C program that runs on the MicroBlaze. The first part of this program

deals with the reception of the data frames and their storage in DDR memory. We first

began by adding the basic Xilinx header files (xparameters.h, xbasic_types.h). We also

included the header file (xemac_l.h) in order to use its functions to initialize the EMAC

controller and to perform other important functions such as receiving frames. We first

began by setting the MAC address of the FPGA using the following function:

XEmac_mSetMacAddress(EMAC_BASEADDR, LocalAddress);

The parameter EMAC_BASEADDR has a value of 0x40c00000 and is the base

address of this device in memory. The LocalAddress parameter is the MAC address that

we previously assigned to the FPGA and has a hexadecimal value of 01 06 07 08 09 04.

This address is stored in an integer array of size 6 as shown below:

static Xuint8 LocalAddress[MAC_ADDR_SIZE] =

0x01, 0x06, 0x07, 0x08, 0x09, 0x04

;

In order to receive frames into the FPGA, we also needed to create a reception

buffer that we called RxFrameBuf. RxFrameBuf is an array with size 1500 (which is the

maximum size in bytes of the frames we will be sending). After all the necessary

initializations are made, the program enters a while loop in which it will receive frames.

Since we know exactly how many bytes we need to send from the PC to the board, we

used this value as a limit for looping. One important task that we had to incorporate into

the program was to filter out certain frames that do not contain data from the training

49

phase. That is, the Windows operating system on the PC randomly sends broadcast

frames across the Ethernet port. We had to insure that such broadcast frames were not

confused with data frames. We will explain how this was done shortly, but for now it is

important to note that the variable GoodFrameCount in the code represents the number of

actual training phase data frames received. Upon entering the loop, we used the

XEmac_RecvFrameSS function to receive a data frame as follows

Length = XEmac_RecvFrameSS(EMAC_BASEADDR, (Xuint8 *)RxFrameBuf);

The parameters of this function are the base address of the device and the

corresponding buffer where the frames would be stored. The XEmac_RecvFrameSS that

we used is a modification of the function XEmac_RecvFrame available in the xemac_l.h

library. The XEmac_RecvFrameSS function begins by checking if the receive buffer is

empty. If it is not, then there is a frame in the buffer ready for retrieval:

check = XEmac_mIsRxEmpty(BaseAddress);

while (check==XTRUE)


Next, it finds the length of the received frame by checking the address location

of the last byte, and using the base address of the device to calculate the difference.

Finally, the function filters out the broadcast frames that were mentioned previously. It

does so by checking that the destination EMAC address of the frame matches the MAC

address of the FPGA. In the case of a broadcast frame, the destination address is FF FF

FF FF FF FF. If the frame’s MAC address does not match the MAC address of the

FPGA, then the function will return a length of -1 and the frame will be discarded in the

main function.

50

After the XEmac_RecvFrameSS function, the program goes into a loop in which

it reads the individual bytes of the frame from the receive buffer in order to store the

frame in DDR memory. Below is the piece of code responsible for retrieving the bytes

and storing them in memory.

for (i=14; i<1014; i+=4)

rec1 = (Xuint32) RxFrameBuf[i];

rec2 = (Xuint32) RxFrameBuf[i+1];



rec2 <<= 8;

rec3 <<= 16;

rec4 <<= 24;

word = 0;

word = word | rec1 | rec2 | rec3 | rec4;

XDdr_mWriteReg (MEM_BASEADDR, memcount*4, word);

memcount++;

We begin by reading the 15th

byte present in the buffer since the first 14 bytes

represent the source MAC address, destination MAC address and the length of the frame.

The actual data starts on the 15th

byte and ends on the 1015th

byte. When reading the

bytes from the receive buffer, we do so four at a time since every word that will be stored

in memory will be of size 32 bits (ie 4 bytes). We begin by reading RxFrameBuf[i],

RxFrameBuf[i+1], RxFrameBuf[i+2], RxFrameBuf[i+3] and storing them in variables.

We then have to concatenate the above four bytes into one word. This can be done by

shifting the most significant byte by 24 places and inserting zeros, then shifting the next

most significant byte by 16 places and inserting zeros and similarly adjusting the least

two significant bytes. After shifting, we perform an OR function on the four bytes to

obtain one 32 bit word.

51

We then store the word in memory using the XDdr_mWriteReg

(MEM_BASEADDR, memcount*4, word) function which stores the 32 bit word variable

in the memory address that has an offset of 4*memcount from the base address. We then

increment memcount. The program then goes back through the loop again to receive the

frames. We also made the necessary adjustments to the code when receiving the last

frame since it was of size 404 bytes instead of 1000 bytes.

4.4.2 Verification and Testing of Ethernet interface and DDR In order to verify that all the data needed was transferred across the Ethernet

interface, we began by sending a single frame from the PC to the FPGA. We printed the

contents of this frame on the PC end by using printf statements, and on the FPGA side,

we accessed all the consecutive memory locations where the bytes were stored and sent

them through the serial port of the FPGA to observe on the HyperTerminal. We then

compared all the values of the bytes at both ends and they were identical.

After validating the correct reception of one frame, we moved on to send

multiple frames. We first attempted to send an entire file of size approximately 1 MB.

When attempting to send the consecutive frames, we noticed that after a few frames were

received, frame reception was blocked. This was due to the fact that the frames were

being sent at a much faster rate than the relatively small reception buffer could handle.

Since the maximum size of BRAM is limited, we could not expand these buffers to take

more of the BRAM space. Instead, we slowed down the sending end by incorporating

wait statement between the transmissions of consecutive packets. After doing this, we

were able to receive all the frames. Once again, we compared the data on both ends, and

52

they matched. At this point, we were certain that all the data needed was being sent and

correctly stored in memory.

4.4.3 Recognition Phase Implementation on the FPGA As discussed previously, the recognition stage of the PCA algorithm involves 3

major stages: normalization, projection, and distance calculations. The first stage of the

recognition phase involves reading the test face that we had received over Ethernet and

stored in memory and normalizing it with respect to the average face that is also located

in memory. The following code illustrates this stage:

// normalization stage

for (i=0; i<TESTFACE_SIZE; i+=4)

r1 = XDdr_mReadReg (MEM_BASEADDR, TESTFACE_BASEADDR+i);

r2 = XDdr_mReadReg (MEM_BASEADDR, AVGFACE_BASEADDR+i);

r1 = r1 - r2;

XDdr_mWriteReg (MEM_BASEADDR, TESTFACE_BASEADDR+i, r1);

The loop iterates over all values of TESTFACE_SIZE, which is defined as the size

of the test face in bytes, or 75,000 bytes. In the for loop, r1 and r2 are the respective

values of the test face and the average face stored in memory locations

TESTFACE_BASEADDR and AVGFACE_BASEADDR offset by the iteration value, i,

and the base address of memory, MEM_BASEADDR. These two values represent the

memory addresses corresponding to the first elements of the test face and the average

face, respectively. Finally, the result is stored in place of the original test face in memory.

In the projection stage of the algorithm, we are multiplying the normalized test

face by the matrix of covariance eigenvectors. In order to accomplish this matrix

multiplication, the outer loop iterates over the number of faces in the database, and the

inner loop iterates over the size of the test face stored. During each inner loop iteration,

53

we obtain the corresponding values for the Eigenvector matrix and the test face from

DDR memory using the indexing illustrated below:

r1 = XDdr_mReadReg (MEM_BASEADDR, EIG_BASEADDR + i*4 + j*NUMFACES);

r2 = XDdr_mReadReg (MEM_BASEADDR, TESTFACE_BASEADDR+j);

We then multiply r1 with r2 and accumulate the product. Once the inner loop runs

to completion, the cumulative product is stored in a corresponding array location and set

back to 0 for the next outer loop iteration.

In the final stage of the PCA algorithm recognition phase, the Euclidian distance

between the projection computed earlier and each of the projections stored in the

projections matrix has to be calculated. Since the projections matrix has size

NUMFACES × NUMFACES, both the outer and inner loops iterate over NUMFACES.

Inside, a projection value is read from memory and from this we value we subtract the

corresponding value in the projections array. Next, this value is cast to a floating point

value, squared, and accumulated. As an illustration of the distance calculation, below is a

sample of the inner loop code that finds the squared values for the Euclidean distance:

r1 = XDdr_mReadReg (MEM_BASEADDR, PROJ_BASEADDR+(i*NUMFACES+j)*4);

int1 = r1 - projections[j];

ftest = (Xfloat32) int1;

ftemp += ftest*ftest;

After this is done, the Euclidian distance is found by calling the sqrt function to

calculate the square root of the temporary accumulated value. We then compare this

distance to the smallest distance already calculated. If it is smaller, we set it as the new

minimum distance and proceed. Finally, we display this distance value and the

corresponding face index, which represents which face the smallest distance value

belongs to. Below is the code for finding the minimum distance:

54

if (i==0)

min = fdist;

imark = i;

else

if (fdist < min)

min = fdist;

imark = i;

4.4.4 Verification and Testing of Recognition Phase

After completing the recognition phase implementation on the FPGA, we moved

to testing our design by running the algorithm for numerous faces and comparing the

results obtained with those of the purely software implementation. Once we ensured that

the results of both implementations were the same, i.e., a test face produced the same

final result on both implementations, we moved into checking the values after every stage

of the recognition process. Moreover, we printed the values of the various stages of the

software implementation onto the screen output of the PC and simultaneously used the

hardware debugger available on the FPGA to investigate the memory contents. Please

refer to the appendix for the projection distances of both implementations.

4.4.5 Implementation of Performance Measurements

In order to measure the amount of time it takes to execute each stage of the

algorithm on the FPGA board, and hence decide on which area to focus on for hardware

optimization, the OPB Timer module had to be added to the MicroBlaze based system.

Adding this timer module provided us with a software interface that can calculate the

number of processor execution cycles. In the initialization stages of our C code, the OPB

Timer module instance timer is declared and initialized.

55

XTmrCtr timer;

XTmrCtr_Initialize(&timer,XPAR_OPB_TIMER_0_DEVICE_ID);

In order to measure the number of execution cycles a specific operation takes, the

timer has to be reset prior to starting it. Following the completion of the operations, the

timer is stopped and the value is read. The following code illustrates this:

XTmrCtr_Reset(&timer,0);

XTmrCtr_Start(&timer,0);

// operations to be measured

XTmrCtr_Stop(&timer,0);

cycles = XTmrCtr_GetValue(&timer,0);

On the PC end, measuring the time involved obtaining a header file from an open

source and using the timing functions to calculate the number of clock cycles completed

during runtime. Below are the commands that we used to measure the execution time of

the corresponding stages on the PC:

QueryPerformanceCounter(&start_ticks)

// RECOGNITION STAGE CODE GOES HERE

QueryPerformanceCounter(&end_ticks);

cputime.QuadPart = end_ticks.QuadPart- start_ticks.QuadPart;

printf ("\tElapsed CPU time test: %.9f sec\n",

((float)cputime.QuadPart/(float)ticksPerSecond.QuadPart));

5. CRITICAL APPRAISAL

5.1 Researching Face Recognition

All throughout this project, we faced several decisions to ensure that our face

recognition system design would work out as planned. In the early stages of the project,

we only had a vague idea of what we wanted to implement. We started gaining some

experience in the field of face recognition by reading and researching numerous

publications.

56

Some publications detailed algorithms for performing face recognition, spanning

from the statistical to the biologically inspired. We also reviewed publications that

described hardware implementations of face recognition. After having gone through all

this information, we had a better feel for the subject.

The first major decision involved choosing which algorithm to adopt. Since each

of the algorithms has its own merits and drawbacks, we had to assign weights to these

factors relative to our needs. We decided that the most important needs involved

simplicity of implementation, accuracy, and the ability to gain from hardware

optimizations. After convening for several days, we decided to choose the PCA

algorithm, which was outlined earlier in this report. It proved to be simple to implement,

relatively accurate, and ideal for hardware optimization, as it involved a great deal of

matrix multiplication.

5.2 Modeling in MATLAB

We then studied the PCA algorithm in more detail by reviewing the relevant

papers and learning the necessary mathematical concepts. After discussing the matter, we

decided that the most convenient starting point to test our understanding of the PCA

algorithm was to model it in MATLAB. The reason for our decision was that MATLAB

provides several functions that bypass low-level details, thereby allowing us to focus on

the algorithm itself.

After obtaining a free face database from the Internet, we began coding the

algorithm in MATLAB. Since MATLAB provides a rich set of visualization tools, we

were able to check the code at several stages and obtain a deeper conceptual

understanding of how face recognition works. As a result, we learned a valuable lesson

57

that it is better to simulate a system using high-level tools before delving into details.

Indeed, the MATLAB simulation made our transition to C very smooth and strengthened

our understanding of the concepts at hand.

5.3 Working with C

When coding the training stage in C, we also faced some difficulties with low-

level issues inherent to the language. For example, the handling of large matrices for face

recognition involved a great deal of memory allocation, initialization, and pointer

referencing. Also, at first we were overwhelmed by the number of functions that had to

be written in C. Predefined math libraries could not be used because of the limited space

available on the FPGA. As a result, we had to write customized functions in C. Although

we were facing some difficulties dealing with all these functions, we adopted a modular

approach, which greatly simplified our work.

One particular problematic function was the function that calculates the

Eigenvectors of a matrix. Since finding the Eigenvectors involves advanced numerical

methods, we chose to use the function defined in the book “Numerical Recipes in C.”

However, after several days of working with the functions, we were still not getting

correct results. As a result, we had to follow the function step by step, until we finally

spotted the problem. This taught us the lesson that to overcome a problem, it is important

to narrow down within the code until the problem is limited to only a few lines. This

method for debugging proved to be very useful with future problems.

After we coded and readily tested our functions, we then proceeded by coding the

actual PCA algorithm. Now that we had developed a library of custom functions,

58

implementing the algorithm essentially followed a path similar to MATLAB. Because we

had tested every component for correctness, we were quickly able to get the algorithm to

produce results identical to MATLAB. Indeed, we learned that it is vital to test each of

the components extensively before integrating them.

The code in C simulated both the training stage of the algorithm and the

recognition stage. However, ultimately, we wanted to perform the training stage on the

host PC, store the intermediate data as binary files, transmit the binary files to the FPGA,

and perform the recognition stage on the MicroBlaze. As a result, we had to write

functions that translated matrix variables into byte format and wrote them to binary files.

In order to test the correctness of the binary file, we opened it in a HEX viewer and

compared some sequences of data with the original variables. After finding some errors

in the data, we discovered that the problem was in the method we were writing the data.

It turned out that there is a special option that must be specified for writing to binary files,

as opposed to regular files.

5.4 Using the FPGA

On the FPGA end, we faced more difficulties, primarily because the platform was

relatively new to us. Xilinx Platform Studio has a very sharp learning curve, making it

difficult to progress. However, thanks to the course “Embedded System Design,” we

were able to gather the necessary tools to work with FPGA boards. Nevertheless, we still

faced several difficulties, especially in areas that required adding a new core to the

system, writing some VHDL code, and interfacing with IP cores like Ethernet.

Our first task on the FPGA end involved being able to read and write from DDR

memory. This was crucial to our project because all the data generated in the training

59

stage must be stored in external memory, as the working memory of the FPGA is not

sufficiently large in size to accommodate the requisite data. One of the problems that we

faced was that DDR is word-addressable while the binary files are byte-addressable. As a

result, we had to concatenate every 4 bytes into a 32-bit word, with proper endian-ness,

and write it to DDR.

Another problem that we were having on the FPGA involved the use of the float

data type. Initially, we had planned on executing the entire PCA algorithm in using floats,

so naturally, we tried to write some experimental C code on the FPGA that adds and

prints two floating-point numbers. Unfortunately, we were not obtaining correct results,

namely because the xil_printf statement, which prints on the HyperTerminal, does not

support the printing of floats. As a result, we had to resort to printing the float values as

decimals, converting them to hexadecimal, and using an online utility to obtain the

corresponding floating point value.

Still, we obtained wrong results. After several days of wrestling with the problem,

we discovered that the decimal values represented the first 32 bits of a 64 bit

representation rather than a standalone 32-bit floating point representation. This taught us

a major lesson, that not everything comes easily, and that several things require spending

time with the code and experimenting with different methods.

5.5 Researching Hardware Multipliers

We faced even more problems when discovering how to interface with the on-

board hardware multipliers through the MicroBlaze processor. Our preliminary analysis

using our C code implementation had showed us that the most computationally

demanding phase of the algorithm was the projection stage that involved matrix

60

multiplication. By using the documentation available with the board and some application

notes, we learned how to instantiate the hardware multipliers through hardware

descriptive VHDL code. We then proceeded to learn how to add a custom IP core on the

board and interface it with C code. After overcoming many problems related satisfying

the timing constraints of the on board clock and hardware synthesis, we were able to

utilize the core to perform fast multiplication.

After even more research into the issue of using hardware multipliers, we

discovered that by altering the parameters of the MicroBlaze instance, we could route all

multiplication instructions in C to the on-board hardware multipliers. This eliminated the

need for the custom IP core and drastically improved our initial performance analysis.

This is because we had removed the overhead of transferring data to and from the custom

core in addition to delays incurred by the control signals needed to regulate the multiply

and accumulate process.

5.6 Learning to Use Ethernet

Having established our ability to work with DDR memory, floats, and hardware

multipliers on the FPGA, we then had to learn how to deal with Ethernet. As previously

mentioned, our choice to use an Ethernet interface instead of a serial interface stemmed

from the fact that the data files to be transmitted were relatively large (greater than 3

MB). Working with Ethernet was more difficult than working with DDR, namely because

it involves two parties, the sending side and the receiving side, both of which must work

correctly.

Thus, we first had to find a way to send Ethernet frames from the PC end. At first

we tried searching for a utility that sends raw Ethernet packets, but we were unable to

61

find a freely available one. We were basically looking for a program the sends frames at

the Ethernet level and not the IP level. After some more searching, we found a C# source

file containing functions for sending raw Ethernet packets. We then modified the code to

send data from the training stage.

At the FPGA end, we wrote the code to receive frames and specified the

necessary Ethernet settings. When we connected the two ends together, we used print

statements and the built in hardware debugger on the FPGA end to make sure we were

receiving packets. At first, no packets were appearing on the receiving end. After revising

our initial Ethernet settings, we discovered that the receiver settings were kept in reset

mode. After resolving this issue, we discovered that we were receiving incorrect data in

the packets. By using the hardware debugger, we found that all packets we were

receiving had the broadcast address as their destination address. The reason for this

anomaly was that Windows was sending broadcast packets through Ethernet. When we

pinpointed the problem, we simply coded a filter that discards all unwanted packets.

5.7 Porting PCA to the FPGA

Having written to binary files, sent them over Ethernet, and stored them on DDR,

we were then able to proceed to the next stage. This involved actually coding the PCA

algorithm on the FPGA. Since we had already implemented the recognition stage using C

on the host computer, the task at hand was to modify the functions we already created in

order to accommodate the way the matrices are stored in DDR memory and accessing

floating point values.

For example, the normalization phase, which was written as a function earlier,

was now modified to retrieve the corresponding memory locations, performing the

62

subtraction, and writing the results back to memory. Similarly, the projection and

distance calculations involved several modifications for memory access. Throughout,

most of the hurdles we faced were related to indexing problems. This is due to the fact

that matrix multiplication involves two-dimensional entities which are stored linearly in

memory. As a result, we had to make sure that the indexing for matrix element access

was correct. To do so, we had to run the code several times and print out the results until

they matched the values obtained on the PC.

Other issues that we faced in the PCA implementation on the FPGA included the

data types that we were using. For example, the function that retrieves data from memory

stores the word in a variable of type Xuint32, which represents an unsigned 32-bit

integer. However, after performing subtraction (such as in the distance calculation), the

data type would lead to wrong values. We then corrected this deviation by assigning the

subtraction operation to Xint32 instead of Xuint32. We also faced other problems related

to the data types, such as float and long int. These problems taught us much about the

importance of understanding the nature of the data used in the system.

5.8 Performance Assessment

In our final testing stages, we had to compare two different implementations of

the algorithm on the FPGA along with our first simulation on a host PC. The first

implementation inherently utilized the on board hardware multipliers in order to perform

the required multiplication operation. Our other implementation relied on programmable

gates in order to implement the multipliers. When we moved to download this

implementation onto the FPGA, we found that the code was too big to fit in the BRAM

blocks. As such, we had to try and reduce the size of our code without losing

63

functionality. This proved to be impossible. As such, we resorted to test each phase of the

algorithm individually and measure the performance. Through this process, we learnt the

importance and value of our limited memory resources and how to optimize our

implementation so as to use these resources efficiently.

When trying to measure the performance of each of our implementations, we

faced some difficulties in finding the correct tools for this purpose. On the FPGA side,

we had to add the OPB Timer module to count the execution cycles. This proved to

further limit our memory resources and force us to run the simulation on each of the

algorithm stages individually. On the host PC side, the regular C libraries did not provide

us with measurements that were accurate enough. Therefore, we had to resort to using

some open source functions and libraries to find accurate measurements for each of the

stages of the algorithm.

To sum up, it is evident that the past two semesters have been extremely fruitful

in terms of the amount of knowledge and experience acquired. Although at first we were

overwhelmed by the magnitude of this project, we discovered that breaking down our

problems into smaller pieces yielded quick and effective solutions. Moreover, we learned

a great deal about hardware implementations and FPGA programming, thereby widening

the scope of our applied knowledge.

64

6. RESULTS 6.1 Methodology Overview

After completing the implementation phase of our project, we moved on to the

analysis and performance assessment of our results. This involved obtaining several

execution time performance metrics and using them to interpret the relative efficiency of

our system.

As stated earlier, speed is of prime importance when it comes to the process of

recognizing a face. As such, the next most reasonable step involved obtaining a temporal

breakdown of the recognition phase. Specifically, the recognition phase can be broken

down into the following components:

• Normalization

• Projection

• Distance Calculation

Our performance measurements were based on two FPGA implementations

benchmarked against a purely software implementation in C. All three candidate

implementations are outlined below:

Implementation 1 Implementation 2 Implementation 3 Device Acer Laptop Virtex-II Board Virtex-II Board

Processor 1.7 GHz Centrino 100 MHz MicroBlaze 100 MHz MicroBlaze

Environment MS Visual C++ Xilinx Platform Studio Xilinx Platform Studio

Multiplier Software Programmable Gates Dedicated Hardware

Table 4 :Implementation Descriptions

Prior to obtaining any measurements, we developed a hypothesis that the

projection phase would consume the longest portion of time. This rational stemmed from

65

the fact that this phase involves high computational demand in the form of a matrix

multiplication operation.

6.2 PC Implementation

In order to test our hypothesis, we first timed the recognition stage on

Implementation 1. This entailed importing additional libraries, along with predefined

time functions. We then started/ended the timer before/after each of the 3 stages of our

algorithm, and obtained the following results:

Implementation 1 Phase Execution Time Clock Cycles Elapsed Normalization 0.235 milliseconds 399,500 clock cycles

Projection 49.5 milliseconds 84,150,000 clock cycles

Distance Calculation 3.32 milliseconds 5,644,000 clock cycles

TOTAL 53.055 milliseconds 90,193,500 clock cycles

Table 5: Implementation 1 Results

In the above table, the number of clock cycles elapsed was obtained by

multiplying the execution time by 1.7 GHz, representing the speed of the Centrino

processor. Although it is probably true that not all of the clock cycles are being used for

the recognition stage, but rather to sustain operation system functions, they must still be

included. The reason for this is that in reality, a system implemented on a PC would have

to run on an operating system and incur the overhead of OS calls.

6.3 FPGA Implementations

We next moved on to measure the timings involved in Implementations 2 and 3.

The only method of accomplishing this was to add an opb_timer, which is an IP core that

must be added to the project. As a result, the entire system had to be regenerated since the

66

IP core interfaces with the OPB bus on the FPGA. Once regeneration was completed, we

measured the execution of each of the 3 stages for recognition. It is important to note that

unlike in Implementation 1, the opb_timer measures execution in terms of clock cycles as

opposed to units of time. Thus, we had to divide the number of clock cycles by 100 MHz

to obtain the execution time.

Implementation 2 Phase Clock Cycles Elapsed Execution Time Normalization 1,558,397 clock cycles 15.5 milliseconds

Projection 211,744,171 clock cycles 2.12 seconds

Distance Calculation 1,474,767 clock cycles 14.7 milliseconds

TOTAL 214,777,335 clock cycles 2.15 seconds


Implementation 3 Phase Clock Cycles Elapsed Execution Time Normalization 1,550,129 clock cycles 15.5 milliseconds

Projection 77,361,175 clock cycles 774 milliseconds

Distance Calculation 1,152,310 clock cycles 11.5 milliseconds

TOTAL 80,063,614 clock cycles 801 milliseconds


Firstly, comparing the results of implementations 2 and 3, we notice that using the

hardware dedicated multipliers in implementation 3 resulted in a significant speed-up in

time. The number of clock cycles of the projection phase in implementation 3 is almost

63% lower than implementation 2. As expected, the normalization phases in both FPGA

implementations were practically identical due to the fact that no multiplications take

place in this phase.

Lastly, there was approximately a 22% speed-up in the distance calculations in

implementation 3 over implementation 2 since this phase inherently involves squaring

67

values (i.e. multiplying values by themselves). The speed-up was not as high as in the

projection phase since the distance calculation phase is not purely multiplication-

intensive.

FPGA COMPARISON

0

50,000,000

100,000,000

150,000,000

200,000,000

250,000,000

Normalization Projection Distance

Stage

Cu

mu

lati

ve C

lock C

ycle

s

Impl. 2

Impl. 3

Figure 11: FPGA Comparison

6.4 Overall Performance Analysis

Looking closely at the results of the first and third implementations, we notice

here that the execution time is slower than in the case of the software implementation.

This is primarily due to the fact that the PC we ran the software implementation on has a

1.7 GHz processor versus a 100 MHz processor running on the FPGA. However, if we

take the number of clock cycles in an absolute sense, the third hardware implementation

took approximately 10% less clock cycles to execute than the software implementation.

As such, comparing the performance in terms of clock cycles shows that the third

implementation is the fastest, as shown in the graph below.

68

TOTAL PERFORMANCE

0

50,000,000

100,000,000

150,000,000

200,000,000

250,000,000

1 2 3

Implementation Number

Clo

ck C

ycle

s

Figure 12: Total Performance

The justification for clock-cycle-based comparison stems from the fact that in

reality, multiple hardware units would be used in parallel to run the face recognition

algorithm. Furthermore, in a real implementation, an FPGA board with a faster processor

core would be used to speed up the algorithm. Lastly, the algorithm can be manufactured

on an ASIC, resulting in a further increase in performance.

Thus far, our performance measurements have shown us that by utilizing the on-

board hardware multipliers, we can greatly improve the performance of our system. The

device utilization summary below reveals that there is ample room to make use of

available hardware multiplier units, or the MULT18X18s units:

Device Utilization Summary:

Number of MULT18X18s 10 out of 40 25%

Therefore, due to the availability of these multipliers and the nature of matrix

multiplication, future work in this field could be centered on trying to utilize this resource

69

in order to parallelize the process of matrix multiplication and further improve the

performance of this application.

7. EXTERNAL FACTORS AND CONSTRAINTS

We embarked upon the project of creating a face recognition system in hardware

in order to promote the quality as well as the safety of human life. From a security

perspective, face recognition systems are vital means of identifying people and insuring

that only the appropriate people have access to certain area. Post-September 11,

numerous international airports have tightened security in order to identify certain black-

listed potential terrorists, many of whom disguise themselves in an attempt to hide their

identity. Face recognition systems are more than ever needed in such airports, and act as

a tool in determining the identity of passengers and crew members alike.

Moving to smaller-scale applications, face recognition has been introduced into

the home security industry, and the market for such security systems has been growing

since. The accuracy with which such systems can detect features in a human face not only

insures correct functionality, but also allows homeowners to leave behind valuable

belongings without the constant burden of worrying about potential burglaries.

As face recognition systems are becoming more and more popular by the day,

there is a need for research into ways of designing systems with low cost, high reliability

and high accuracy and speed. The aim of our project is to use the hardware resources of

an FPGA in order to speed up the recognition process while maintaining a high level of

accuracy. Whether these systems are used in airports, offices or homes, speed is a very

important constraint. Passengers should not have to wait extra hours in line in order for

authorities to identify them. Rather, it should be a walk-through process free of waiting

70

time. The same applies to applications in the home security industry. Fast face

recognition systems would allow for instant detection and entrance into the home.

In addition to investigating ways to speed up the recognition process, we also

focused on creating a system that is sustainable and upgradeable. Faces can be added to

the existing database with great ease, and re-computing the new data is instantaneous.

The fact that we used a Field Programmable Gate Array to implement our system is one

of the key advantages over other systems. It provides for easy upgrading of the system

simply through the modification of code that runs on the core processor. Moreover, new

cores and new features would cost very little since the system we created leaves available

a huge amount of programmable gates for future modifications.

On the ethical side, face recognition treads on some thin territory regarding the

privacy of individuals. Many individuals prefer to have more discrete forms of

identification and detection that do not rely on such direct biometric measurements.

However, the privacy of an individual can be sustained by ensuring that the process is

automated and that the images captured are stored securely on a server. In this manner,

we can capitalize on the benefits of face recognition while preserving individual privacy.

Finally, from an economic vantage point, our project is an investment into

research that could result in millions of dollars of savings. Automation and speed-up will

lead to a lower need for human intervention, thereby cutting costs across several

frontiers. Nevertheless, the start-up cost for such a project is quite steep as it involves

revamping the entire security infrastructure that permeates modern life.

71

8. CONCLUSION

Over the course of the past two terms, we have researched the field of face

recognition, familiarized ourselves with the FPGA, and modeled the PCA algorithm in

both MATLAB and C. We next developed the system requirements of our intended

design and created a block diagram depicting the interconnection among the various

components of our system. Lastly, we implemented the algorithm on the FPGA, complete

with Ethernet, DDR Memory, and on-board hardware multipliers. Profiling the code

revealed that matrix multiplication was the most time consuming aspect of the algorithm

and that on-board multipliers result in the most optimized operation.

Our system can be further enhanced in several different ways. For example, a

friendly user interface can be created to improve software usability. Performance can be

further enhanced by employing hardware multipliers running in parallel and by

improving the clock speed of the soft core processor on the FPGA board. Having pieced

together the face recognition system over several months of milestones and setbacks, we

learned some valuable lessons. We hope that this system provides some additional insight

into the field of face recognition and contributes to the development of the field.

72

9. REFERENCES

[1] W. Zhao, R. Chellapra, P.J. Phillips, A. Rosenfeld, “Face Recognition: A

Literature Survey,” ACM Computing Surveys, Vol. 35, No. 4, December 2003,

pp. 399-458

[2] M.A. Turk, A.P. Pentland. “Face Recognition Using Eigenfaces,” IEEE

Conference on Computer Vision and Pattern Recognition, pp.586--591, 1991.

[3] P. N. Belhumeur, J. P. Hespanha, D. J. Kriegman, “Eigenfaces vs. Fisherfaces:

Recognition using class specific linear projection,” IEEE Trans. Pattern Anal.

Machine Intell., vol. 19, pp. 711–720, May 1997.

[4] M.S. Bartlett, J.R. Movellan, T.J. Sejnowski, “Face Recognition by Independent

Component Analysis”, IEEE Trans. on Neural Networks, Vol. 13, No. 6,

November 2002, pp. 1450-1464

[5] H. Ando, N. Fuchigami, M. Sasaki, A. Iwata, “A Prototype Software System for

Multi-object Recognition and its FPGA Implementation,” Proc. Third Hiroshima

International Workshop on Nano-electronics for Terra-Bit Information

Processing, 2004.

[6] Gottumukkal R., and Asari K.V., “System Level Design of Real Time Face

Recognition Architecture Based on Composite PCA,” Proc. GLSVLSI 2003,

2003, pp. 157-160.

[7] Hau T. Ngo, Rajkiran Gottumukkal, Vijayan K. Asari. "A Flexible and Efficient

Hardware Architecture for Real-Time Face Recognition Based on Eigenface",

isvlsi, pp. 280-281, Proc. IEEE Computer Society Annual Symposium on VLSI:

New Frontiers in VLSI Design (ISVLSI'05), 2005.

[8] X. Li and S. Areibi, “A Hardware/Software Co-design Approach for Face

Recognition,” Proc. 16th International Conference on Microelectronics, Tunis,

Tunisia, Dec 2004.

[9] Moritoshi Yasunaga, Taro Nakamura, and Ikuo Yoshihara, “A Fault-tolerant

Evolvable Face Identification Chip,” Proc. Int. Conf. on Neural Information

Processing, pp.125-130, Perth, November 1999.

[10] In Ja Jeon, Boung Mo Choi, Phill Kyu Rhee. "Evolutionary Reconfigurable

Architecture for Robust Face Recognition," ipdps, p. 192a, International Parallel

and Distributed Processing Symposium (IPDPS'03), 2003.

[11] Press, William H., Brian P. Flannery, Saul A. Teukolsky, and William T.

Vetterling. Numerical Recipes in C: The Art of Scientific Computing. 2nd ed.:

Cambridge University Press, 1992.

73

10. APPENDIX 10.1 PCA Code in MATLAB function [distances] = pca(A,test_face,k)

fprintf(1,'Computing average face...\n');

average_face = mean(A);

num_of_faces = size(A,1);

fprintf(1,'Computing vector differences...\n');

for i = 1:num_of_faces

faces_diff(i,:) = A(i,:) - average_face;

end;

fprintf(1,'Computing L matrix...\n');

L = faces_diff * faces_diff';

fprintf(1,'Computing Eigenvectors of L...\n');

[V,D] = eigs(L,k);

fprintf(1,'Extracting Eigenvectors of covariance matrix...\n');

eigenvec_u = faces_diff' * V;

%fprintf(1,'Normalizing eigenvectors...\n');

%z = sum(eigenvec_u,1);

%eigenvec_u = eigenvec_u ./ z (ones(size(eigenvec_u,1), 1) ,:);

%eigenvectors = eigenvec_u;

fprintf(1,'Computing face projections...\n');

projections = faces_diff * eigenvec_u;

fprintf(1,'Testing a face...\n');

%test_face = B(3,:);

test_norm = test_face - average_face;

test_proj = test_norm * eigenvec_u;

distances = dist(projections, test_proj');

10.2 PCA Code in C

10.2.1 Matrix Library #include <stdio.h>

#include <stdlib.h>

#include <math.h>

#include "eig.c"

void matrix_print(float** matrix, int height, int width)

int i,j;

74

for(i=0;i<height;i++)

for(j=0;j<width;j++)

printf ("%f\t",matrix[i][j]);

printf("\n");

printf("\n");

void vector_print(float* vector, int size)

int i;

for(i=0;i<size;i++)

printf ("%f\t",vector[i]);

printf("\n\n");

void matrix_transpose(float** matrix_in, int height_in, int width_in, float** matrix_out)

int i,j;

for(i=0;i<height_in;i++)

for(j=0;j<width_in;j++)

matrix_out[j][i] = matrix_in[i][j];

void matrix_average(float** matrix_in, int vector_num, int vector_size, float* matrix_out)

int i,j;

float temp;

for(i=0;i<vector_size;i++)

temp = 0;

for(j=0;j<vector_num;j++)

temp += matrix_in[j][i];

matrix_out[i] = temp/vector_num;

void matrix_multiply(float** matrix1, int height1, int width1,

float** matrix2, int height2, int width2,

float** matrix_out)

int i,j,k;

float temp;

for(i=0;i<height1;i++)

for(j=0;j<width2;j++)

temp = 0;

for(k=0;k<width1;k++)

temp = temp + matrix1[i][k]*matrix2[k][j];

matrix_out[i][j] = temp;

75

void matrix_subtract(float** matrix_in, int vector_num, int vector_size,

float* vector, float** matrix_out)

int i,j;

for(i=0;i<vector_num;i++)

for(j=0;j<vector_size;j++)

matrix_out[i][j] = matrix_in[i][j] - vector[j];

float vector_distance(float* vector1, float* vector2, int size)

int i;

float temp = 0;

for(i=0;i<size;i++)

temp = temp + (vector1[i] - vector2[i]) * (vector1[i] - vector2[i]);

return (float)sqrt(temp);

void eig(float** mat,int n, float* eval, float** evec)

double** mnew;

int* iterations;

double** vectors;

double* values;

int i,j;

iterations = malloc(4*sizeof(int));

mnew = malloc((n+1)*(n+1)*sizeof(double));

for(i=0;i<(n+1);i++)

mnew[i] = malloc((n+1)*sizeof(double));

vectors = malloc((n+1)*(n+1)*sizeof(double));

for(i=0;i<(n+1);i++)

vectors[i] = malloc((n+1)*sizeof(double));

values = malloc((n+1)*sizeof(double));

for(i=1;i<=n;i++)

for(j=1;j<=n;j++)

mnew[i][j] = (double) mat[i-1][j-1];

jacobi(mnew,n,values,vectors,iterations);

eigsrt(values,vectors,n);

for(i=1;i<=n;i++)

for(j=1;j<=n;j++)

evec[i-1][j-1] = (float) vectors[i][j];

for(i=1;i<=n;i++)

eval[i-1] = (float) values[i];

void read_images(float** db, int height, int width)

76

FILE * pFile;

long lSize;

float* buffer;

int i,j;

// open file

pFile = fopen ( "database.txt" , "rb" );

// obtain file size

fseek (pFile , 0 , SEEK_END);

lSize = ftell (pFile);

rewind (pFile);

// allocate memory to contain the whole file

buffer = (float*) malloc (lSize);

// copy the file into the buffer.

fread (buffer,1,lSize,pFile);

for(i=0;i<height;i++)

for(j=0;j<width;j++)

db[i][j] = buffer[j*height+i];

// close file

fclose (pFile);

void read_testface(float* face, int size)

FILE * pFile;

long lSize;

float* buffer;

int i;

// open file

pFile = fopen ( "testface.txt" , "rb" );

// obtain file size

fseek (pFile , 0 , SEEK_END);

lSize = ftell (pFile);

rewind (pFile);

// allocate memory to contain the whole file

buffer = (float*) malloc (lSize);

// copy the file into the buffer.

fread (buffer,1,lSize,pFile);

for(i=0;i<size;i++)

face[i] = buffer[i];

// close file

fclose (pFile);

77

10.2.2 Eigenvector Functions #include <stdio.h>

#include <math.h>

#include <stdlib.h>

static double sqrarg;

#define SQR(a) ((sqrarg=(a)) == 0.0 ? 0.0 : sqrarg * sqrarg)

#define SIGN(a,b) ((b) >= 0.0 ? fabs(a) : -fabs(a))

#define ROTATE(a,i,j,k,l) g=a[i][j];h=a[k][l];a[i][j]=g-s*(h+g*tau);a[k][l]=h+s*(g-h*tau)

double pythag(double a, double b)

/* Computes sqrt(a^2 + b^2) without destructive underflow or

overflow */

double absa, absb;

absa = fabs(a);

absb = fabs(b);

if (absa > absb)

return (absa * sqrt(1.0 + SQR(absb / absa)));

else

return (absb == 0.0 ? 0.0 : absb * sqrt(1.0 + SQR(absa/absb)));

void jacobi (double **a, int n, double *d, double **v, int *nrot)

/* Computes all eigenvalues and eigenvectors of a real symmetric

matrix a[1..n][1..n]. On output, elements of a above thep

diagonal are destroyed. d[1..n] returns the eigenvalues of

a. v[1..n][1..n] is a matrix whose columns contain, on output,

the normalized eigenvectors of a. nrot returns the number of

Jacobi rotations that were required. */

int j, iq, ip, i;

double tresh, theta, tau, t, sm, s, h, g, c, *b, *z;

b = (double *) calloc (n, sizeof(double));

if (b == NULL)

perror ("calloc b in jacobi()");

return;

b--;

z = (double *) calloc (n, sizeof(double));

if (z == NULL)

perror ("calloc z in jacobi()");

return;

z--;

/* Initialize to the identity matrix */

for (ip = 1; ip <= n; ip++)

78

for (iq = 1; iq <= n; iq++)

v[ip][iq] = 0.0;

v[ip][ip] = 1.0;

/* Initialize b and d to the diagonal of a. This vector will

accumulate terms of the form ta_pq as in equation (11.1.14). */

for (ip = 1; ip <= n; ip++)

b[ip] = d[ip] = a[ip][ip];

z[ip] = 0.0;

*nrot = 0;

for (i=1;i<=50;i++)

sm = 0.0;

/* Sum off-diagonal elements */

for (ip = 1; ip <= n-1; ip++)

for (iq = ip + 1; iq <= n; iq++)

sm += fabs(a[ip][iq]);

/* The normal return, which relies on quadratic convergence to

machine underflow */

if (sm == 0.0)

free(++z);

free(++b);

return;

if (i < 4)

tresh = 0.2 * sm / (n*n); /* on the first three swaps */

else

tresh = 0.0; /* thereafter */

for (ip=1; ip<=n-1; ip++)

for (iq=ip+1 ; iq<=n; iq++)

g = 100.0 * fabs(a[ip][iq]);

/* After four sweeps, skip the rotation if the off-diagonal

element is small. */

if (i > 4 && (double) (fabs(d[ip]) + g) == (double)

fabs(d[ip]) && (double) (fabs(d[iq]) + g) == (double)

fabs(d[iq]))

a[ip][iq] = 0.0;

/* Page 2 */

else if (fabs(a[ip][iq]) > tresh)

h = d[iq] - d[ip];

79

if ((double) (fabs(h) + g) == (double) fabs(h))

t = (a[ip][iq]) / h; /* t = 1/(2*theta) */

else

theta = 0.5 * h / (a[ip][iq]); /* equation 11.1.10 */

t = 1.0 / (fabs(theta) + sqrt(1.0 + theta * theta));

if (theta < 0.0)

t = -t;

c = 1.0 / sqrt(1+t*t);

s = t * c;

tau = s / (1.0 + c);

h = t * a[ip][iq];

z[ip] -= h;

z[iq] += h;

d[ip] -= h;

d[iq] += h;

a[ip][iq] = 0.0;

for (j = 1; j <= ip - 1; j++) /* Case of rotations

1 <= j < p */

ROTATE (a, j, ip, j, iq);

for (j = ip + 1; j <= iq - 1; j++) /* Case of rotations

p < j < q */

ROTATE (a, ip, j, j, iq);

for (j = iq + 1; j <= n; j++) /* Case of ratations

q < j <= n */

ROTATE (a, ip, j, iq, j);

for (j = 1; j <= n; j++)

ROTATE (v, j, ip, j, iq);

++(*nrot);

/* Update d with the sum of ta_pq and reinitialize z */

for (ip = 1; ip <= n; ip++)

b[ip] += z[ip];

d[ip] = b[ip];

z[ip] = 0.0;

fprintf (stderr, "Too many iterations in routine jacobi\n");

void eigsrt (double *d, double **v, int n)

/* Given the eigenvalues d[1..n] and eigenvectors v[1..n][1..n]

as output from jacobi (section 11.1) or tqli (section 11.3),

this routine sorts the eigenvalues into decending order, and

rearranges the columns of v corespondingly. The method is

80

straight insertion. */

int k, j, i;

double p;

for (i = 1; i < n; i++)

p = d[k=i];

for (j = i + 1; j <= n; j++)

if (fabs(d[j]) >= fabs(p))

p = d[k=j];

if (k != i)

d[k] = d[i];

d[i] = p;

for (j = 1; j <= n; j++)

p = v[j][i];

v[j][i] = v[j][k];

v[j][k] = p;

void tred2(double **a, int n, double *d, double *e)

/* Householder reduction of a real, symmetric matrix

a[1..n][1..n]. On output, a is replaced by the orthgonal

matrix Q effecting the transformation. d[1..n] returns the

diagonal elements of the tridiagonal matrix, and e[1..n] the

off-diagonal elements, with e[1] = 0. Several statements, as

noted in commensts, can be omitted if only eigenvalues are to

be found, in which case a contains no useful information on

output. Otherwise they are to be included. */

int l, k, j, i;

double scale, hh, h, g, f;

for (i = n; i>= 2; i--)

l = i - 1;

h = scale = 0.0;

if (l > 1)

for (k = 1; k <= l; k++)

scale += fabs(a[i][k]);

if (scale == 0.0) /* skip transformation */

e[i] = a[i][l];

else

for (k = 1; k <= l; k++)

a[i][k] /= scale; /* use scaled a's for transformation*/

h += a[i][k] * a[i][k]; /* form sigma in h */

f = a[i][l];

g = (f >= 0.0 ? -sqrt(h) : sqrt(h));

e[i] = scale * g;

h -= f * g; /* Now h is equation (11.2.4) */

a[i][l] = f-g; /* Store u in the ith row of a. */

81

f = 0.0;

for (j = 1; j <= l; j++)

/* Next statement can be omitted if eigenvectors not wanted

*/

a[j][i] = a[i][j] / h; /* Store u/H in ith column of a. */

g = 0.0; /* Form an element of Au in g. */

for (k = 1; k <= j; k++)

g += a[j][k] * a[i][k];

for (k = j+1; k <= l; k++)

g += a[k][j] * a[i][k];

e[j] = g/h; /* Form element of p in temporarily

unused element of e */

/* Page 2 */

f += e[j] * a[i][j];

hh = f / (h + h); /* Form K, equation (11.2.11). */

for (j = 1; j <= l; j++) /* Form q and store in e

overwriting p */

f = a[i][j]; /* Note that e[l] = e[i-1] survives */

e[j] = g = e[j] - hh * f;

for (k = 1; k <= j; k++) /* Reduce a, equation (11.2.13) */

a[j][k] -= (f * e[k] + g * a[i][k]);

else

e[i] = a[i][l];

d[i] = h;

/* Next statement can be omitted if eigenvectors not wanted */

d[1] = 0.0;

e[1] = 0.0;

/* Contents of this loop can be omitted if eigenvectors not wanted

except for statement d[i] = a[i][i]; */

for (i = 1; i <= n; i++) /* Begin accumulation of

transformation matrices */

l = i - 1;

if (d[i]) /* This block skipped when i = 1 */

for (j = 1; j <= l; j++)

g = 0.0;

for (k = 1; k <= l; k++) /* Use u and u/H stored in a to form

PQ */

g += a[i][k] * a[k][j];

for (k = 1; k <= l ; k++)

a[k][j] -= g * a[k][i];

d[i] = a[i][i]; /* This statement remains */

82

a[i][i] = 1.0; /* Reset row and column of a to

identity matrix for next iteration

*/

for (j = 1; j <= l; j++)

a[j][i] = a[i][j] = 0.0;

void

tqli(double *d, double *e, int n, double **z)

/* QL algorithm with implicit shitfs, to determine the

eigenvalues and eigenvectors of a real, symmetric, tridiagonal

matrix, or of a real, symmetric matrix previously reduced by

tred2 (section 11.2). On input, d[1..n] contains the diagonal

elements of the tridiagonal matrix. On output, it returns the

eigenvalues. The vector e[1..n] inputs the subdiagonal

elements of the tridiagonal matrix, with e[1] arbitrary. On

output, e is destroyed. When finding only the eigenvalues,

several lines may be omitted, as noted in the comments. If

the eigenvectors of a tridiagonal matrix are desired, the

matrix z[1..n][1,,n] is input as the identity matrix. If the

eigenvectors of a matrix that has been reduced by tred2 are

required, then z is input as the matrix output by tred2. In

either case, the kth column of z returns the normalized

eigenvector corresponding to d[k]. */

double pythag (double a, double b);

int m, l, iter, i, k;

double s, r, p, g, f, dd, c, b;

/* Convenient to renumber the elements of e */

for (i = 2; i <= n; i++)

e[i-1] = e[i];

e[n] = 0.0;

for (l = 1; l <= n; l++)

iter = 0;

do

/* Look for a single small subdiagonal element to split the

matrix */

for (m = l; m <= n - 1; m++)

dd = fabs(d[m]) + fabs(d[m+1]);

if ((double) fabs(e[m] + dd) == dd)

break;

if (m != l)

if (iter++ == 30)

fprintf (stderr, "Too many iterations in tqli\n");

g = (d[l+1] - d[l]) / (2.0 * e[l]); /* Form shift */

r = pythag(g, 1.0);

g = d[m] - d[l] + e[l] / (g + SIGN(r,g)); /* this is d_m -

k_s */

83

s = c = 1.0;

p = 0.0;

/* Page 2 */

/* A plane rotation as in the original QL, followed by Givens

rotations to restore tridiagonal form. */

for (i = m-1; i >= l; i--)

f = s * e[i];

b = c * e[i];

e[i+1] = (r = pythag(f,g));

/* recover from underflow */

if (r == 0.0)

d[i+1] -= p;

e[m] = 0.0;

break;

s = f/r;

c = g/r;

g = d[i+1] - p;

r = (d[i] - g) * s + 2.0 * c * b;

d[i+1] = g + (p = s * r);

g = c * r - b;

/* Next loop can be omitted if eigenvectors not wanted */

/* Form eigenvectors */

for (k = 1; k <= n; k++)

f = z[k][i+1];

z[k][i+1] = s * z[k][i] + c * f;

z[k][i] = c * z[k][i] - s * f;

if (r == 0.0 && i >= l)

continue;

d[l] -= p;

e[l] = g;

e[m] = 0.0;

while (m != l);

84

10.2.3 PCA Algorithm #include <stdio.h>

#include <stdlib.h>

#include <math.h>

#include "matrix.c"

#define NUMFACES 51

#define FACESIZE 18750

main()

/*****declarations*****/

int i,i_mark;

float** database;

float** database_trans;

float* average;

float** L;

float* eigenvalues;

float** eigenvectors;

float** eigenvectors_orig;

float** projections;

float** test_face;

float** test_projection;

float min, temp_min;

/***initializations****/

database = malloc(NUMFACES*FACESIZE*sizeof(float));


database[i] = malloc(FACESIZE*sizeof(float));

average = malloc(FACESIZE*sizeof(float));

L = malloc(NUMFACES*NUMFACES*sizeof(float));


L[i] = malloc(NUMFACES*sizeof(float));

database_trans = malloc(NUMFACES*FACESIZE*sizeof(float));


database_trans[i] = malloc(NUMFACES*sizeof(float));

eigenvalues = malloc(NUMFACES*sizeof(float));

eigenvectors = malloc(NUMFACES*NUMFACES*sizeof(float));


eigenvectors[i] = malloc(NUMFACES*sizeof(float));

eigenvectors_orig = malloc(NUMFACES*FACESIZE*sizeof(float));


eigenvectors_orig[i] = malloc(NUMFACES*sizeof(float));

85

projections = malloc(NUMFACES*NUMFACES*sizeof(float));


projections[i] = malloc(NUMFACES*sizeof(float));

test_face = malloc(FACESIZE*sizeof(float));

for(i=0;i<1;i++)

test_face[i] = malloc(FACESIZE*sizeof(float));

test_projection = malloc(NUMFACES*sizeof(float));

for(i=0;i<1;i++)

test_projection[i] = malloc(NUMFACES*sizeof(float));

/*****pca training*****/

// obtain database

read_images(database,NUMFACES, FACESIZE);

// find average face

matrix_average(database,NUMFACES,FACESIZE,average);

// normalize database

matrix_subtract(database,NUMFACES,FACESIZE,average,database);

// compute L matrix

matrix_transpose(database,NUMFACES,FACESIZE,database_trans);

matrix_multiply(database,NUMFACES,FACESIZE,database_trans,FACESIZE,NUMFACES,L);

// compute eigenvectors of L

eig(L,NUMFACES,eigenvalues,eigenvectors);

// derive eigenvectors of original matrix

matrix_multiply(database_trans,FACESIZE,NUMFACES,eigenvectors,NUMFACES,NUMFACE

S,eigenvectors_orig);

// compute face projections

matrix_multiply(database,NUMFACES,FACESIZE,eigenvectors_orig,FACESIZE,NUMFACES,

projections);

/***pca recognition****/

// obtain test face

read_testface(test_face[0],FACESIZE);

// normalize test face

matrix_subtract(test_face,1,FACESIZE,average,test_face);

// project test face

matrix_multiply(test_face,1,FACESIZE,eigenvectors_orig,FACESIZE,NUMFACES,test_projecti

on);

// compute minimum distance

86


if(i==0)

min = vector_distance(test_projection[0],projections[i],NUMFACES);

else

temp_min = vector_distance(test_projection[0],projections[i],NUMFACES);

if(temp_min < min)

min = temp_min;

i_mark = i;

printf("The minimum distance belongs to face %i and has a value of %f\n",i_mark+1, min);

10.3 Ethernet Code in C# // read binary file

FileStream fs = File.OpenRead("b_testface_int");

BinaryReader br = new BinaryReader(fs);

// destination mac

packet[0] = 0x01;

packet[1] = 0x06;

packet[2] = 0x07;

packet[3] = 0x08;

packet[4] = 0x09;

packet[5] = 0x04;

// source mac

packet[6] = 0x00;

packet[7] = 0x56;

packet[8] = 0x00;

packet[9] = 0xFF;

packet[10] = 0x02;

packet[11] = 0xC5;

// length of data bytes

packet[12] = 0x03;

packet[13] = 0xE8;

for (i = 0; i < 75000; i++)

packet[14 + i % DATA_SIZE] = br.ReadByte();

if (i % DATA_SIZE == 999)

rawether.DoWrite(packet);

count++;

for (j = 0; j < 150000; j++) ;

87

Console.WriteLine(count.ToString());

br.Close();

fs.Close();

count = 0;

// read binary file

fs = File.OpenRead("b_avg_int");

br = new BinaryReader(fs);

// destination mac

packet[0] = 0x01;

packet[1] = 0x06;

packet[2] = 0x07;

packet[3] = 0x08;

packet[4] = 0x09;

packet[5] = 0x04;

// source mac

packet[6] = 0x00;

packet[7] = 0x56;

packet[8] = 0x00;

packet[9] = 0xFF;

packet[10] = 0x02;

packet[11] = 0xC5;


packet[12] = 0x03;

packet[13] = 0xE8;

for (i = 0; i < 75000; i++)




count++;

for (j = 0; j < 150000; j++) ;


br.Close();

fs.Close();

count = 0;

// read binary file

fs = File.OpenRead("b_eigen_int");


// destination mac

packet[0] = 0x01;

packet[1] = 0x06;

packet[2] = 0x07;

packet[3] = 0x08;

packet[4] = 0x09;

88

packet[5] = 0x04;

// source mac

packet[6] = 0x00;

packet[7] = 0x56;

packet[8] = 0x00;

packet[9] = 0xFF;

packet[10] = 0x02;

packet[11] = 0xC5;


packet[12] = 0x03;

packet[13] = 0xE8;

for (i = 0; i < 3825000; i++)




count++;

for (j = 0; j < 150000; j++) ;


br.Close();

fs.Close();

count = 0;

// read binary file

fs = File.OpenRead("b_proj_int");


// destination mac

packet[0] = 0x01;

packet[1] = 0x06;

packet[2] = 0x07;

packet[3] = 0x08;

packet[4] = 0x09;

packet[5] = 0x04;

// source mac

packet[6] = 0x00;

packet[7] = 0x56;

packet[8] = 0x00;

packet[9] = 0xFF;

packet[10] = 0x02;

packet[11] = 0xC5;


packet[12] = 0x03;

packet[13] = 0xE8;

for (i = 0; i < 10404; i++)

89


if (i == 10403)


count++;

break;



count++;

for (j = 0; j < 150000; j++) ;


br.Close();

fs.Close();

10.4 FPGA Code

/***************************** Include Files

*********************************/

#include "xparameters.h"

#include "xbasic_types.h"

#include "xemac_l.h"

#include "xio.h"

#include "xpacket_fifo_l_v2_00_a.h"

#include "xddr.h"

#include "xddr_l.h"

#include "time.h"

#include "xtmrctr.h"

/************************** Constant Definitions

*****************************/

#define EMAC_HDR_SIZE 14 /* size of Ethernet header */

#define MAC_ADDR_SIZE 6 /* size of MAC address */

#define MAX_FRAME_SIZE 1500

#define MAX_FRAME_SIZE_IN_WORDS ((MAX_FRAME_SIZE / sizeof(Xuint32)) +

1)

#define EMAC_BASEADDR 0x40c00000

#define MEM_BASEADDR 0x22000000

#define TESTFACE_BASEADDR 0

#define TESTFACE_SIZE 75000

#define AVGFACE_BASEADDR 75000

#define AVGFACE_SIZE 75000

#define EIG_BASEADDR 150000

#define EIG_SIZE 3825000

#define PROJ_BASEADDR 3975000

#define PROJ_SIZE 10404

90

#define TOTAL_FRAMES 3986

#define NUMFACES 51

// PROTOTYPES

int XEmac_RecvFrameSS(Xuint32 BaseAddress, Xuint8 *FramePtr);

void wait(Xuint32 time);

// mac address of the FPGA

static Xuint8 LocalAddress[MAC_ADDR_SIZE] =

0x01, 0x06, 0x07, 0x08, 0x09, 0x04

;

static Xuint8 RxFrameBuf[MAX_FRAME_SIZE];

Xuint32 FrameCount = 0;

Xuint32 GoodFrameCount = 0;

int main ()

printf("Inside MAIN\r\n");

int FrameSize;

int Length;

int imark;

//Integer that should be written to control register

Xuint32 setting_control;

Xuint32 rec1,rec2,rec3,rec4,word,memcount,r1,r2,product,index;

Xuint32 projections[NUMFACES];

Xfloat32 fdist,ftemp, min;

Xint32 int1;

Xfloat32 ftest;

Xfloat32 xflt1,xflt2;

XTmrCtr timer;

Xuint32 cycles;

XTmrCtr_Initialize(&timer,XPAR_OPB_TIMER_0_DEVICE_ID);

XTmrCtr_SetResetValue(&timer,0,0x00000000);

memcount = 0;

setting_control=2409652224;

XEmac_mWriteReg(EMAC_BASEADDR, XEM_ECR_OFFSET,setting_control);

//set MAC address

XEmac_mSetMacAddress(EMAC_BASEADDR, LocalAddress);

printf("Ready...\r\n");

int i,j;

// receive and store all frames

while (GoodFrameCount < TOTAL_FRAMES)

91

Length = XEmac_RecvFrameSS(EMAC_BASEADDR, (Xuint8 *)RxFrameBuf);

if (Length == -1)

continue;

//printf("Back from RECEIVE Function with Good Packet,

Length = %d\r\n",Length);

GoodFrameCount++;

//printf("Good Frame Count : %d\r\n", GoodFrameCount);

// for last frame

if (GoodFrameCount == 3986)

for (i=14; i<418; i+=4)





rec2 <<= 8;

rec3 <<= 16;

rec4 <<= 24;

word = 0;



memcount++;

break;

for (i=14; i<1014; i+=4)





rec2 <<= 8;

rec3 <<= 16;

rec4 <<= 24;

word = 0;



memcount++;

// RECOGNITION STAGE

// normalization stage



for (i=0; i<TESTFACE_SIZE; i+=4)

92

r1 = XDdr_mReadReg (MEM_BASEADDR, TESTFACE_BASEADDR+i);

r2 = XDdr_mReadReg (MEM_BASEADDR, AVGFACE_BASEADDR+i);

r1 = r1 - r2;

XDdr_mWriteReg (MEM_BASEADDR, TESTFACE_BASEADDR+i, r1);



xil_printf("Normalization Cycles: %d\r\n",cycles);

// projection stage



for (i=0; i<NUMFACES; i++)

product = 0;

for (j=0; j<TESTFACE_SIZE; j+=4)

r1 = XDdr_mReadReg (MEM_BASEADDR, EIG_BASEADDR +

i*4+j*NUMFACES);

r2 = XDdr_mReadReg (MEM_BASEADDR,

TESTFACE_BASEADDR+j);

product += r1 * r2;

projections[i] = product;



xil_printf("Projection Cycles: %d\r\n",cycles);

// distances



for (i=0; i<NUMFACES;i++)

ftemp = 0;

for (j=0; j<NUMFACES; j++)

r1 = XDdr_mReadReg (MEM_BASEADDR,

PROJ_BASEADDR+(i*NUMFACES+j)*4);

int1 = r1 - projections[j];

ftest = (Xfloat32) int1;

ftemp += ftest*ftest;

//fdist = sqrt(ftemp);

fdist = ftemp;

if (i==0)

min = fdist;

imark = i;

else

93

if (fdist < min)

min = fdist;

imark = i;



xil_printf("Distance Cycles: %d\r\n",cycles);

xil_printf("The face is %d and has a value of %d\n",imark+1,

min);

return 0;

// RECEIVE FRAME FUNCTION

int XEmac_RecvFrameSS(Xuint32 BaseAddress, Xuint8 *FramePtr)

//printf("Received a frame\r\n");

XStatus check;

check=XFALSE;

int Length;

//Wait for a frame to arrive


while (check==XTRUE)


FrameCount++;

if (FrameCount % 100 == 0)

printf("FrameCount : %d\r\n", FrameCount);

//Get the length of the frame that arrived

Length = XIo_In32(BaseAddress + XEM_RPLR_OFFSET);

/*

* Use the packet fifo driver to read the FIFO. We assume the

Length is

* valid and there is enough data in the FIFO - so we ignore the

return

* code.

*/

(void)XPacketFifoV200a_L0Read(BaseAddress + XEM_PFIFO_RXREG_OFFSET,

BaseAddress +

XEM_PFIFO_RXDATA_OFFSET,

FramePtr, Length);

/*

* Clear the status now that the length is read so we're ready

again

* next time

*/

XIo_Out32(BaseAddress + XEM_ISR_OFFSET, XEM_EIR_RECV_DONE_MASK);

if (FramePtr[0] == 1 &&

94

FramePtr[1] == 6 &&

FramePtr[2] == 7 &&

FramePtr[3] == 8 &&

FramePtr[4] == 9 &&

FramePtr[5] == 4)

//printf ("Received a GOOD Packet\r\n");

return Length;

else

return -1;

void wait(Xuint32 time)

Xuint32 cnt = 0;

while(cnt<time)

cnt++;

10.5 Recognition Results

Reading database...

Calculating average face...

Normalizing...

Computing L matrix...

Computing Eigenvectors of L...

Deriving original Eigenvectors...

Computing Projections...

Done...

Floating Point Implementation : Projection Distance Calculations

284757230.717693

279660242.475770

0.000000

15906094.951599

43353201.762848

196593658.015564

208707188.981688

217342335.956240

91545197.913032

290078638.103123

287208901.165014

286391165.244354

134212683.905218

114121723.805576

95377111.228415

64110028.836780

70614899.348352

73224709.697560

108391813.933073

108619514.482631

126522187.291307

95

285006939.446492

287533372.651606

289339502.859201

289201806.462897

271901321.489990

270497789.759402

264169974.413589

255327852.287155

258682538.304327

251213559.123769

259579475.746799

263522708.902784

243460224.373596

256767062.230726

249083248.619941

78699745.431568

257229726.685719

264791717.983083

303273149.042238

307864942.616162

300581205.147086

293359560.382787

298626547.239392

291808943.826539

302661000.071662

301379863.871029

308336091.791721

332560286.526925

345542685.107667

346003889.433077

The minimum distance belongs to face 3 and has a value of 0.000000

***

Integer Implementation : Projection Distance Calculations

284225955.814634

279126687.096212

620494.066290

15590958.052778

42948163.259742

196103692.379704

208225158.720017

216861641.784979

91093926.417311

289572508.175456

286698981.540504

285883839.142958

133857089.836859

113775500.038695

95048959.642352

63801517.932157

70310480.676750

72818549.508288

108058501.418556

108302177.848746

96

126181027.633667

284607217.509605

287127633.331032

288925692.704150

288760773.338336

271457671.284751

270053651.838463

263708795.309670

254867805.414352

258215407.944628

250731364.175974

259102093.073329

263054021.925713

242983661.512971

256275114.290106

248603954.983011

78496933.796400

256721561.955371

264280157.572359

302767434.835988

307361433.381023

300074824.514051

292915766.134634

298183088.693402

291376106.729249

302256564.576121

300979732.587588

307934681.917593

332104472.690896

345087533.640911

345549694.756852

The minimum distance belongs to face 3 and has a value of 620494.066290

Press any key to continue

FYP SPRING REPORT - read.pudn.comread.pudn.com/downloads392/doc/project/1678502/Face Recgnition...

Documents

Transcript of FYP SPRING REPORT - read.pudn.comread.pudn.com/downloads392/doc/project/1678502/Face Recgnition...