cse.sc.educse.sc.edu/.../reconfigurable_papers/Akella-Thesis-031109Abstract… · Web viewby....

DESIGN AND ANALYSIS OF A CUSTOM COMPUTING ARCHITECTURE FOR

THE UPGMA BIOINFORMATICS ALGORITHM

by

Sreesa Akella

Bachelor of ScienceAndhra University, 1998

__________________________________________________

Submitted in Partial Fulfillment of the

Requirements for the Degree of Master of Science in the

Department of Computer Science and Engineering

College of Engineering and Information Technology

University of South Carolina

2003

______________________________ ______________________________Department of Computer Science and Department of Computer Science andEngineering EngineeringDirector of Thesis 2nd Reader

______________________________ ______________________________Department of Computer Science and Dean of the Graduate SchoolEngineering 3rd Reader

ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to my thesis advisor, Dr. James P.

Davis for the relentless motivation and support he provided, aiding me to complete my

thesis on time. His unflinching optimism, undying enthusiasm and focus towards this

project had inspired me to a great degree. His constant advise pushed me to look at a

problem in a different perspective and helped me visualize concepts in a broader manner.

I would like to extend my appreciation to Dr. Duncan Buell and Dr. John Rose for

their continous guidance and inspiration. Their valuable advise from time to time had

given this project an optimal direction.

I would also like to thank my parents and friends who have been constant force

of motivation and support that sustained me through tough times and helped me achieve

this goal.

ii

ABSTRACT

In recent years, reconfigurable custom computing has become an increasingly

viable option for implementing high-performance computing applications.

Reconfigurable VLSI logic, on which custom computing systems are built, provides

several orders of magnitude speed-up in execution performance of algorithms over the

execution of these on conventional microprocessor-based systems. In addition, such

systems have the flexibility to program--and reprogram via reconfiguration--the actual

logic functions of the VLSI circuit with different applications in time and space. Custom

computing systems are implemented using FPGA custom-logic devices that are easily

and quickly programmed by an end-user. This research presents the design and analysis

of a custom computing application architecture for the UPGMA Bioinformatics algorithm

implemented on an FPGA-based custom-computing platform. We present the

Bioinformatics problem domain and architectures that were implemented and assessed.

We also discuss the final architecture created and present results of the system

performance, as measured and compared against that of the UPGMA algorithm written in

C, running on a single-processor Pentium® PC.

iii

TABLE OF CONTENTS

ACKNOWLEDGEMENTS.............................................................................................II

ABSTRACT.....................................................................................................................III

TABLE OF CONTENTS................................................................................................IV

LIST OF TABLES.........................................................................................................VII

LIST OF FIGURES.....................................................................................................VIII

INTRODUCTION.............................................................................................................1

1.1 VON NEUMANN VERSUS RECONFIGURABLE CUSTOM COMPUTING...........................11.2 CUSTOM LOGIC DESIGN VERSUS CUSTOM COMPUTING............................................21.3 FIELD PROGRAMMABLE GATE ARRAYS.....................................................................31.4 APPLICATION PROGRAMMING AND DESIGN STYLES..................................................51.5 THESIS PROPOSAL......................................................................................................7

1.5.1 Thesis Research Objective and Tasks.................................................................8

BACKGROUND..............................................................................................................11

2.1 PHYLOGENETICS AND TREE-RECONSTRUCTION METHODS......................................112.1.1 Background on trees.........................................................................................122.1.2 Phylogenetic Algorithms..................................................................................13

2.2 THE UPGMA...........................................................................................................142.2.1 Algorithm..........................................................................................................142.2.2 Complexity and Bottlenecks on UPGMA........................................................16

2.3 FIELD PROGRAMMABLE GATE ARRAYS...................................................................172.3.1 Input Output Blocks (IOBs).............................................................................182.3.2 Configurable Logic Blocks (CLBs)..................................................................192.3.3 Programmable Routing Matrix.........................................................................202.3.4 Resources on a Virtex-E chip...........................................................................21

2.4 RECONFIGURABLE COMPUTING...............................................................................22

DISCUSSION OF THE WILDCARD CUSTOM COMPUTING PLATFORM......25

3.1 THE ANNAPOLIS WILDCARDTM SYSTEM...............................................................253.2 THE WILDCARDTM SYSTEM VHDL MODEL..........................................................273.3 WILDCARDTM HOST PROGRAMMING.....................................................................29

3.3.1 Opening and Closing the WILDCARDTM board..............................................303.3.2 Clock Control...................................................................................................313.3.3 Processing Element and Interrupt Control........................................................313.3.4 Memory Control...............................................................................................32

3.4 PE EMBEDDED APPLICATION INITIALIZATION.........................................................33

CUSTOM COMPUTING DESIGN OF UPGMA........................................................34

iv

4.1 VLSI DESIGN FLOW.................................................................................................344.2 UPGMA PROJECT DESIGN FLOW............................................................................364.3 UPGMA DESIGN......................................................................................................37

4.3.1 Design Parameters............................................................................................384.3.2 Design Datapath................................................................................................394.3.3 Design Architecture..........................................................................................414.3.1 Adder................................................................................................................414.3.2 Add Register.....................................................................................................424.3.3 Height Adder....................................................................................................434.3.4 Height Register.................................................................................................434.3.5 Multiplier..........................................................................................................434.3.6 Multiplier Register............................................................................................444.3.7 Divider Unit......................................................................................................444.3.8 Divider Register................................................................................................444.3.9 Comparator.......................................................................................................454.3.10 Least Distance Register..................................................................................454.3.11 Row and Column Registers............................................................................454.3.12 Controller........................................................................................................454.3.13 Counter Units..................................................................................................474.3.14 Multiplexers....................................................................................................474.3.15 Address Generator..........................................................................................474.3.16 Output Generator............................................................................................514.3.17 Height memory...............................................................................................514.3.18 Off-chip Memory Banks.................................................................................514.3.19 Addressing Schemes.......................................................................................534.3.20 Top-level Block..............................................................................................56

4.4 DESIGN VERIFICATION.............................................................................................56

EXPERIMENTAL DATA SET AND PERFORMANCE MEASUREMENT...........58

5.1 EXPERIMENTAL APPARATUS FOR UPGMA.............................................................585.2 Generating Random Taxa Test Data Sets............................................................605.3 Measuring Time...................................................................................................62

EXPERIMENTAL METHOD AND RESULTS...........................................................65

6.1 RUNNING THE EXPERIMENTS...................................................................................656.2 EXPERIMENTAL RESULTS FOR LATENCY.................................................................666.3 BOUNDING TIME COMPLEXITY................................................................................726.4 BENCHMARKING AGAINST PHYLIP........................................................................75

SUMMARY AND CONCLUSIONS..............................................................................83

7.1 SUMMARY OF RESEARCH CONTRIBUTIONS..............................................................837.2 CONCLUSIONS..........................................................................................................847.3 FUTURE WORK.........................................................................................................86

7.3.1 Memory size and Memory address schemes....................................................867.3.2 Latency for a Memory Read.............................................................................877.3.3 Device size........................................................................................................88

v

BIBLIOGRAPHY............................................................................................................90

APPENDIX A...................................................................................................................93

VHDL SOURCE CODE................................................................................................93

APPENDIX B.................................................................................................................162

CUSTOM COMPUTING MACHINE HOST PROGRAM SOURCE CODE...........162

vi

LIST OF TABLES

TABLE 1 Virtex – E Chip Resources………………...……………….…………….21

TABLE 2 Address Mapping…………….…………………………….…………….55

TABLE 3 Timing Results for permuted data for 32 taxa dataset….….…………….67

TABLE 4 Latency Values for Datasets at Generated Number of Taxa.…………….71

TABLE 5 PHYLIP Run-time Raw Dataset………………………..….…………….78

TABLE 6 Data Comparison Between Hardware and Software UPGMA

Implementations………………………………………………………….79

LIST OF FIGURES

FIGURE 1 Architecture of an FPGA Device….………………………………………4

FIGURE 2 A Phylogenetic Tree showing a Relationship between Four Species...….12

FIGURE 3 Distance Matrix…………………………………………….…………….15

FIGURE 4 Structure of Xilinx XCV300E Device...………………….…………….18

FIGURE 5 Virtex – E Input Output Block Architecture……………….…………….19

FIGURE 6 A Two-Slice Virtex – E CLB...…………………………….…………….20

FIGURE 7 The WILDCARDTM Platform Block Diagram…………….…………….26

FIGURE 8 The WILDCARDTM Software Design Hierarchy...…….….…………….30

FIGURE 9 An HDL-based Design Process Model…………………….…………….35

FIGURE 10 Design Datapath...………………………………………….…………….40

FIGURE 11 Block Diagram of UPGMA Architecture………………….…………….42

FIGURE 12 The Controller Algorithm………………………………….…………….46

FIGURE 13 Typical Read Cycle from Memory..……………………….…………….52

FIGURE 14 Typical Write Cycle from Memory.……………………….…………….52

FIGURE 15 Distance Matrix…………………………………………….…………….53

FIGURE 16 Test Data Generator Input Dialog Box………………………………......61

FIGURE 17 Frequency Distribution for Latency versus Taxa Data Set Permutation...67

FIGURE 18 Mean Latency versus Number of Taxa (Normal Scale)..….…………….69

FIGURE 19 Latency versus Number of Taxa..………………………….…………….70

FIGURE 20 Bounding of Latency by time Complexity Functions..…….…………….73

FIGURE 21 Bounding Latency by Time Complexity Functions Computed in

Excel……………………………………………………………….…….74

FIGURE 22 PHYLIP C run-time performance...……………………….………….….76

FIGURE 23 PHYLIP C run-time performance with Time-Complexity Bounding.…..77

FIGURE 24 Plotting the performance improvement over PHYLIP as Taxa Count

grows.…………………………………………………………………….80

FIGURE 25 Plotting the performance difference as Taxa count grows...……….…….81

FIGURE 26 Plotting the performance difference as Taxa count grows (Log Plot).…..82

CHAPTER 1

INTRODUCTION

1.1 Von Neumann versus Reconfigurable Custom Computing

In recent years, reconfigurable custom computing has become an increasingly

viable option for implementing applications requiring high-performance or complex

computations. It is an area that is not as mature as the use of conventional computing

architectures. Traditionally, general-purpose computing involves a serial thread of

executing code running on one or more microprocessors. This microprocessor-based

computing paradigm is considered "general-purpose" in that the processor can be

programmed to run any task—which is an executing application program running on an

operating system or monitor program. Once a processor has been designed and

fabricated, the single processor’s IC can solve multiple problems at different points of

time, by fetching program instructions and data from memory, decoding them to

determine an execution plan, then executing each such instruction, in turn.

Reconfigurable computing can also be called "general-purpose", although it uses a

different architecture and supporting application development paradigm for computation.

Unlike a microprocessor, which has its computation as a set of sequential instructions

fetched from system memory, reconfigurable architectures generally compute a function

by configuring functional units and wiring them up in space. This allows a parallel

computation of operators and direct dataflow from the producers of an intermediate result

to the consumers [1, 2].

1.2 Custom Logic Design versus Custom Computing

Application Specific Integrated Circuits (ASICs) could also be used to implement

a design and optimize it to achieve high performance employing spatial architectures.

ASICs, however, are designed using custom logic techniques, creating design artifacts

tailored for a specific application, and thus cannot be reconfigured to perform different

applications. Therefore, although these systems provide high performance through

application-specific optimization, they are not “general purpose”. One other aspect of

ASIC systems is that they have a huge manufacturing cost associated with them.

Reconfigurable systems, on which custom computing systems are built, provide

very good performance and the flexibility to program--and reprogram via reconfiguration

of the logic functionality--the actual device logic, with different applications in time and

space. Additionally, these systems are implemented using FPGA devices that are easily

and quickly programmable by and end-user, are available at affordable prices, and thus

deliver user-defined functionality at a low cost. The performance and logic density of a

single FPGA device have been improving in recent years, leading to more powerful

reconfigurable architectures targetable for a wider range of applications. This has opened

up the use of FPGAs, typically employed in the creation of logic controllers, as

processing elements (PEs) in reconfigurable arrays in applications for high-performance

computing.

1.3 Field Programmable Gate Arrays

In the past few years, the reconfigurable device market has grown considerably

with the availability of a wide range of devices for VLSI systems--one such device being

a Field Programmable Gate Array (FPGA). FPGAs have evolved considerably in the

recent past, with the primary development being the ability to download a bitstream

representing the digital logic functions onto an array of pre-defined arithmetic, logical

and steering resources, so they have become the primary device for building

reconfigurable and adaptive machines. They were originally designed as prototype

devices used for pre-fabrication design emulation. This design activity was employed to

verify the design before fabrication, to avoid the fallout of post-fabrication design error.

A Xilinx FPGA device that is primarily the device we are looking at has a standard

architecture, which is shown in Figure 1 [3].

FPGAs consist of an array of resource types: configurable logic blocks (CLBs),

input/output blocks (IOBs), and programmable interconnect resources. This standard

architecture can be configured, and reconfigured if necessary, by an end user to

implement a particular functionality. The logic blocks are used to implement the required

logic gate and storage elements of the design. The interconnect can be programmed to

appropriately connect the logic blocks to realize a larger functional unit specified for use

by the application.

Figure 1. Architecture of an FPGA Device.

For purposes of consideration in this thesis, the design process of the FPGA device

has the following steps:

Model the design using a hardware description language such as VHDL or

through schematic capture.

Synthesize this design to generate a netlist.

Map the design to the FPGA logic blocks.

Place and route of the design to choose specific logic blocks to use on the FPGA

and to allocate the wire segments to interconnect these logic blocks.

Download the design as a bitstream onto the target FPGA chip.

Steps 2 through 5 are automated and are performed by an assortment of design tools

generally provided by the FPGA device vendor. Some of the major FPGA device

manufacturers and vendors in the market are Xilinx, Actel, and Altera.

In order to use these devices for reconfigurable computing applications, one has

to deal with a number of FPGA issues so as to effectively implement the design. The

computational requirements of the application must be identified and its mapping to the

FPGA device must be evaluated via estimation. This is no easy task, and there is no

standard method to assemble designs. The FPGA tools, which play a major part in this

process, are being continuously improved by the vendors to be more efficient in their

mapping of design architecture to design resources. The trend is that, over time, the

construction of reconfigurable computing systems on FPGAs will be more like software

programming than the hardware design process for custom VLSI that exists today.

1.4 Application Programming and Design Styles

The process of converting a specification into an implementation on FPGA

devices can be addressed in different ways. Different design styles lead to different

interpretations of the specification—a formal or informal description of the application’s

algorithm. An algorithm can be thought of as a set of processing steps for transforming

data by executing a series of computations [4]. The algorithm needs to be interpreted by

a machine to perform the work. Choosing the elements that make up the machine defines

its architecture, and this necessitates looking at different architectural, or design, styles.

Traditionally there have been two generic architectural styles: the software paradigm and

the hardware paradigm [4]. The software paradigm looks at implementing an algorithm

through use of an instruction code sequence that is interpreted by a microprocessor. In

contrast, using the hardware paradigm, an algorithm is mapped onto storage and

functional units that perform the computation without the use of an intermediate

instruction set.

Under the software paradigm, a program for the algorithm is written in a high

level language such as C/C++, which is compiled into a low-level instruction set for the

processor to execute on an underlying hardware with a fixed architecture. A hardware

implementation would look at implementing the design directly onto a hardware device

through mapping to storage and functional units, avoiding the compile-time and operating

system overhead present in the software paradigm. This can provide considerable speed-

up—on the order of two magnitudes--and thus provide a much higher-performance

solution; however, a VLSI hardware application solution generally comes at a higher

cost, since fabricating the implementation on application specific devices is expensive.

At the same time, such application formulations using application-specific VLSI custom

logic are not general purpose, thus necessitating different implementations for different

algorithms. In contrast, the software model would yield a generally lower-performance

solution through the overhead associated with instruction fetch-decode and execute;

however, the solution would generally be cheaper, since microprocessors are mass-

produced, reusable commodity off-the-shelf products, and programming them is not a

difficult task. Furthermore, there are more software-trained professionals who can write

programs on general-purpose processors than there are design engineers who can design

custom-logic VLSI.

FPGAs provide a means to build general-purpose, reconfigurable machines at a

lower cost1. This leads to new design style that can be referred to as the reconfigurable-

computing paradigm, also referred to as the configurable hardware paradigm [4]. This

paradigm supports the implementation of algorithms by providing the performance

benefits from mapping directly onto a hardware platform at a relatively low cost. Thus, it

would be interesting to look at implementing various applications on reconfigurable 1 Such fixed cost is referred to as NRE, or non-recurring engineering costs, which are associated with the

specification, design, implementation, mapping, and test of the logic functions implementing an application on a VLSI device substrate. This is in contrast to the variable costs associated with the fabrication and

production of finished devices, which is based on the volume of production—itself based on the demand.

platforms and evaluating their performance as compared to implementations using the

software paradigm.

Such performance would include conventional notions of latency associated with

carrying out computation, comparing between an application-specific software solution

running on a conventional processor architecture (or even among a collection of

processors, thus distributing the algorithm’s execution across multiple, communicating

processing elements), and also throughput of the architecture to run streaming

computation, if appropriate for the application. However, evaluating performance could

also include comparing the design time of the application—comparing the time to

architect, design, implement and test the application according to the requisite

engineering processes of each paradigm.

1.5 Thesis Proposal

The reconfigurable computing paradigm--and the predominant FPGA device

architecture on which such applications can be built--offers us a good medium for

implementing complex computational tasks having high throughput, low latency

requirements. Many computational tasks spread over a range of application domains

have been implemented and evaluated on reconfigurable computing systems [5, 6, 7, 8,

9]. However, different aspects of application architecture and performance must continue

to be explored, while many new and novel computational problems must be implemented

using reconfigurable custom computing machines before a general understanding of the

characteristics of the reconfigurable computing paradigm can be obtained. This would

provide a wider set of configurable computing solutions, as well as patterns for mapping

between high-level problem-solving architecture and lower-level device architectures,

which can be used to assess the cost/benefit ratio for effective and optimal

implementation of more general programming problems on reconfigurable platforms.

Our research thesis involves examining one such data point in the space of

possible application solutions where high-performance computing using reconfigurable

hardware is required for operating on ever-growing data sets. Namely, we are looking at

the Phylogenetics domain that provides us with a rich set of algorithms that can be

studied to see if they can be implemented efficiently on reconfigurable computing

machines to provide orders of magnitude speedup in the algorithm execution over that

available on standard Von Neumann processor architectures programmed using

conventional programming techniques. In this Bioinformatics domain, the Unweighted

Pair-Group Method with Arithmetic means (UPGMA) algorithm used for phylogenetic

tree-reconstruction purposes has certain computational complexity that makes it an

application of specific interest. Furthermore, it is understood to have a software

programmed implementation that is particularly optimal, that is, it cannot be further

optimized to achieve significant speedup in performance.

1.5.1 Thesis Research Objective and Tasks

It is therefore our objective to explore the space of possible architectures in

custom reconfigurable logic, using FPGA devices as an implementation medium, and

also using conventional custom-logic design processes, to implement a different

“rendition” of the UPGMA algorithm and measure the performance difference. It is our

belief that—although the time complexity of the algorithm is unlikely to change as a

result of implementation in FPGA custom-logic hardware, we do believe that the use of

custom logic VLSI Hardware design techniques should yield up to a two-order-of-

magnitude improvement in the execution speed of the UPGMA algorithm over that

employed in the PHYLIP program written by Felsenstein et al. [10].

The tasks involved in exploring this thesis research work are defined as follows:

1) Select the UPGMA algorithm[11] that performs phylogenetic analysis by building

an evolutionary tree as our problem domain.

2) Identify and analyze the various complex computational tasks and bottlenecks.

3) Evaluate the issues that we need to address in implementing this algorithm on a

reconfigurable custom-logic architecture.

4) Address various FPGA issues while developing a hardware architecture for the

particular problem algorithm at hand.

5) Implement this design on a FPGA-based reconfigurable architecture and device

platform.

6) Evaluate its performance by measuring its throughput with an increase in the

number of taxa, and benchmark these results against those obtained from a

software program (Felsenstein’s PHYLIP) executing on a conventional CPU-

based system.

The Annapolis WILDCARDTM system has been chosen as the target reconfigurable

platform. The WILDCARDTM FPGA board has a Xilinx Virtex® XCV 300E2 as a

processing element, along with two 256K byte memory units, and external I/O

connections. This reconfigurable computing platform was chosen primarily based on

cost and the availability of a reasonable set of platform development tools.

2 Xilinxand Virtex are Registered Trademarks of Xilinx Inc.

Thus, this thesis will attempt to modify the upper bound of the time complexity,

corresponding to a modification of the time constant associated with the complexity

function for the UPGMA algorithm to achieve orders-of-magnitude speedup, while also

contending with the space complexity associated with the limited amount of device

resources available on the Wildcard platform. In addition, given that we will be moving

data to and from the main computer in which the WILDCARDTM sits, and the

WILDCARDTM PCI/PCMCIA board itself, we will be required to assess the penalties

associated with the communication overhead—with the objective of minimizing this as

much as possible.

CHAPTER 2

BACKGROUND

In this chapter, we provide background on the application domain associated with

the UPGMA algorithm and its context in the space of Bioinformatics computational

problem solving. We also discuss the FPGA device technology, which constitutes the

platform on which we will create a reconfigurable computing solution for the UPGMA

problem.

2.1 Phylogenetics and Tree-reconstruction Methods

The study of the relationships between groups of organisms is called taxonomy,

an ancient and venerable branch of classical biology. The branch of taxonomy that deals

with numerical data such as DNA sequences is known as phylogenetics. Biological

systematists who wanted to reconstruct evolutionary genealogies of species based on

morphological similarities originally developed phylogenetic analysis. The results of

phylogenetic analysis may be depicted as a hierarchical branching diagram, a

"cladogram" or "phylogenetic tree" as shown in Figure 2 [12].

Figure 2: A phylogenetic tree showing a relationship between four species.

2.1.1 Background on trees

The tree represents the genealogical evolution of the different species, linking

them through a certain set of similarities and differences. Similarities and differences

between organisms can be coded as a set of characters, each with two or more alternative

character states. In an alignment of DNA sequences, for example, each aligned site is a

separate character, each with four character states, the four nucleotides being adenine,

thymine, cytosine, and guanine.

All the trees are assumed to be binary, meaning that each node branches into two

daughter edges as shown in Figure 2. The edges meet at a branch node, a node being and

endpoint of an edge. Each edge has a certain amount of evolutionary divergence

associated with it, quantified by some distance between sequences. These distances are

referred to as ‘edge lengths’ or ‘branch lengths’. Terminal nodes or leaves correspond to

the observed sequences that might connect up to an ultimate ancestor or ‘root’ of the tree.

A true biological phylogeny has a ‘root’ but only some phylogenetic algorithms provide

information about the location of the root.

For a specific set of n leaves, the nodes and edges of a tree can be counted as

follows: There would be (n–1) nodes in addition to the n leaves, giving a total of

(2n-1) nodes and one fewer edges, that is (2n-2), discounting the edge above the

root node.

2.1.2 Phylogenetic Algorithms

Phylogenetic algorithms cover three main classes of problems [13]: (1)

parsimony, which is like a vertex coloring problem of graph theory; (2) distance

methods, which aim to find a tree whose path distance matches closely to observed

distances; and, (3) likelihood methods, where the likelihood of the data is

calculated using Markov transition matrices. Each approach possesses certain problems

in terms of the computational bottlenecks that occur.

The advantages of putting a phylogenetic algorithm onto reconfigurable custom

computing platform include the following: (1) eliminating intervening levels of

software--such as operating systems--which slows down the execution of the code, etc.;

and, (2) parallelizing or pipelining the algorithm functions by exploiting the natural

capabilities of custom-logic architecture and design. The latter provides far more work

per cycle than code written in a native instruction set on a general-purpose

microprocessor. As discussed earlier, we believe a speed up of up to two orders of

magnitude should be possible with this approach. Furthermore, the bottlenecks within

the algorithms could be avoided by exploiting the underlying hardware resources in

reconfigurable machines to optimize specific parts of the algorithm’s execution that

general-purpose machines cannot offer.

We select a particular phylogenetic distance method algorithm for this

research, namely the UPGMA (Unweighted Pair-Group Method with Arithmetic means)

algorithm whose computational complexities are described below.

2.2 The UPGMA

UPGMA has relevancy beyond phylogenetics, since it is a hierarchical clustering

method that is both fast and useful with gene expression or micro-array data. The

algorithm’s running time complexity is evaluated and compared against with that of the

hardware implementation and the results are presented in Chapter 6. The value of N is

typically around 10,000 to 50,000 in micro-array applications. Thus, even though

software-based phylogenetics applications run this method at a rate of 1 second each [15],

for N = 100, there is an increase by a factor of perhaps 10,000 in micro array applications

even before we consider memory bottlenecks. This last factor causes considerable

problems, since memory usage also scales as O(N2). Thus, this problem might take days

to complete with larger taxa data sets. This algorithm is well understood [11, 14, 15], and

the software solutions have reached a level of optimization beyond which minimal

performance improvement can be obtained. Thus UPGMA is an appropriate candidate

for exploring an implementation on a reconfigurable platform using custom-logic

architecture and design techniques.

2.2.1 Algorithm

We define the distances between two clusters Ci and Cj to be the average distance

between pairs of sequences from each cluster:

dij = (1/|Ci||Cj|) dpq (1)

where |Ci| and |Cj|denote the number of sequences in clusters i and j, respectively

and p and q denote the sequences in each cluster Ci and Cj respectively. If Ck is the

union of clusters Ci and Cj, and if Cl is another cluster, then

dkl = (1/|Ci||Cj|)(dil|Ci| + djl|Cj|) (2)

This forms the average distance calculation for obtaining the distance of the new

cluster Ck to the any other cluster Cl.

The distances are represented in the form of a matrix given below in Figure 3 with

each row or column corresponding to one node. The nodal distance between node i, j

would be in the position [i, j] of the matrix. So D[i, j] would form the distance

between nodes i and j.

Figure 3. Distance Matrix

D[i, i] is not a valid distance since there can be no distance between the same

node. This is therefore marked as “x” in the matrix.

The steps of UPGMA algorithm are as given below [14]:

1. Initialization:

a. Assign each sequence i to its own cluster Ci,

b. Define one leave each T for each sequence, and place at height zero

2. Iteration:

a. Determine the two clusters i, j, for which dij is minimal. (if there are

several equidistant pairs pick one randomly.)

b. Define a new cluster k by Ck = Ci U Cj, and define dkl for all l by (2).

c. Define a node k with daughter nodes i and j, and place it at height

dij/2.

d. Add k to the current clusters and remove i and j.

3. Termination:

a. When only two clusters i, j remain, place the root at height dij/2.

2.2.2 Complexity and Bottlenecks on UPGMA

We believe the UPGMA algorithm has two bottlenecks. The first is in deciding

which of the N(N-1)/2 pairwise distances is minimal at each step of the star-

decomposition clustering. Following this, the data matrix is reduced by dimension 1, due

to clustering of two objects. This introduces the second bottleneck, the need to calculate

an average distance between the two objects (i and j) as a single cluster (k) and all other

objects. This involves complex computational units, costly on general-purpose

microprocessors, but which we believe can be implemented efficiently on reconfigurable

custom logic FPGA device, giving better performance results.

This research examines the function of the UPGMA algorithm, implementing it as

a custom logic architecture. The standard HDL-based design methodology is employed

in that we model the algorithm using the VHDL hardware description language, we

functionally verify the algorithm’s correctness in the custom logic architecture, and then

we synthesize the architecture onto a set of resources to produce a circuit mapped to a

target FPGA device’s component library. The resulting circuit is implemented on a

Xilinx Virtex E® FPGA device and is subjected to functional and performance analysis.

However, before we can present the research method undertaken in this effort—including

the analysis, architecture and design of the circuit implementing the UPGMA algorithm,

we must discuss the characteristics of FPGA devices and their use in reconfigurable

computing that gives this research a high chance of success.

2.3 Field Programmable Gate Arrays

The evolution of the FPGA devices is evidenced by the great strides in the

underlying technology—effective logic gate counts in the millions of transistor gates, the

ability to download and alter the logic via a programmable bitstream while the FPGA

device is in operation—to name a few. Several companies have been developing high-

performance, high-capacity FPGA devices, targeting larger applications such as those

associated with scientific computing. FPGA vendors such as Xilinx, Actel and Altera,

the largest of those producing these devices, have a leadership position in the market.

Our reconfigurable platform, the Annapolis Wildcard® system uses a Xilinx® XCV300E

device. The Xilinx FPGA devices have a standard set of device architecture features,

similar to the one shown in Figure 1 in the previous chapter. We describe the

architecture for the Xilinx XCV300E device below.

Figure 4. Structure of the Xilinx® XCV300E device.[16]

Figure 4 provides an architectural overview of the XCV300E device. There are three

main components in the device, which are: (1) Input output Blocks (IOB), (2)

Configurable Logic Blocks (CLB), including block-programmable RAM (BRAMs)

memory structures; and, (3) the Programmable Routing Matrix.

2.3.1 Input Output Blocks (IOBs)

The input and output blocks on the device provide and interface between the input

and output pins and the Configurable logic blocks (CLBs). The architecture for these

blocks is given in Figure 5. These blocks provide three storage elements that can be used

either as edge-triggered D flip-flops or as level sensitive latches.

Figure 5. Virtex E Input Output Block architecture [16]

2.3.2 Configurable Logic Blocks (CLBs)

The Configurable Logic Blocks provide the functional elements for implementing

logic. The basic building block of the CLB is the Logic Cell (LC). Each Virtex-E CLB

consists of four LCs. The LC consists of a 4-input function generator, carry logic, and a

storage element. The output of the function generator in each LC drives the output of the

CLB and D input of the flip-flop. The architecture for a Virtex E CLB is given in the

Figure 6. The four LCs are organized as two identical slices as shown in the figure.

Figure 6. A Two-Slice Virtex E CLB [16]

The function generators are implemented using Look Up Tables (LUTs) that can

also be configured to be as 16x1 bit synchronous RAM. The two LUTs in a slice can be

combined to create a 16x2 bit or 32x1 bit synchronous RAM, or as a 16x1 dual-port

synchronous RAM element.

2.3.3 Programmable Routing Matrix

The Virtex-E consists of a General Routing Matrix (GRM) that connects the

CLBs together to implement the logic chains. The GRM comprises an array of routing

switches located at the intersection of the horizontal and vertical routing channels. Each

CLB also has local routing resources through which it connects to the GRM. These local

and global routing resources can be programmed to generate the best routing for the

design being configured onto the device. The Xilinx configuration tools take care of the

placing and routing the design onto the device’s resources through user-specified

constraints.

2.3.4 Resources on a Virtex-E chip

The Xilinx Virtex-E resources and their numbers are given below in Table 1:

Resource NumberCLBs 1536Slices 3072LUTs 6144

FlipFlops 6144Block RAMs 256x16-bits 32Block RAMs 256x32-bits 16

Block RAM bits 131072

Table 1: Virtex-E chip resources

Each CLB has two slices and there are two LUT’s and two flip flops per slice.

The Block RAM allocations are based on the how the LUTs are configured. If they are

configured as two 16x1 bit RAM then we can have 32 of the 256x16 bit block RAMs

implemented on the device. If we have two LUTs configured to form a 32x1 bit RAM

then we can have 16 of the 256x32-bit RAMs implemented on the device.

The Xilinx Virtex-E data manual [16] provides a detailed description of the device

architecture along with pin definitions and electrical characteristics. The Virtex E device

with its full complement of resources, provides the designer with a total of 411,955

CMOS transistor gates. This device can thus be used to implement reasonably sizable

designs running at moderately high clock speeds.

2.4 Reconfigurable Computing

The constant improvement in FPGA device density and performance has

prompted many to look at using these devices for implementing high-performance

computing applications. The traditional advantages these devices provide are that they

can be configured and reconfigured with little extra cost (except if reconfiguring during

application runtime) and ease through direct host program control. The increase in gate

count and speed of these devices has also made them an appropriate target for building

high-performance, custom computing machines. These machines, also referred to as

reconfigurable computing machines, provide flexibility to program and reprogram

systems and at the same time provide high performance computing at a relatively low

cost when compared to price-performance models of other high-performance platforms,

such as supercomputers [2, 5, 6, 7, 8, 9]. Several computing platforms consisting of

arrays of FPGA devices have been developed through research and experimentation and

are currently commercially available in the market.

The DEC Paris Research Laboratory’s Programmable Active Memories (PAM)

project was one of the earliest pioneers in reconfigurable computing [1]. The PRL team

implemented the RSA encryption algorithm at speeds that had never been achieved,

beating supercomputers and even custom discrete IC applications at that time.

SPLASH and Splash 2 are two other reconfigurable architectures developed in the

early nineties—Splash 2 being an upgrade of the original SPLASH architecture [17]. The

Splash 2 consisted of 16 printed circuit boards, each consisting of 17 Xilinx XC4000

FPGA chips per board. Each XC4000 had its own memory banks, to which it could

independently read and write. A number of high-performance scientific applications

were implemented on Splash 2, such as in the domains of gene sequence matching,

fingerprint matching and image processing--at speeds of two orders of magnitude greater

than the fastest supercomputers at that time [17].

Several companies have brought commercial platforms to market over the past

few years, attempting to exploit this new computing model. Annapolis Microsystems

[18], SRC Computers Inc [19], and Star Bridge Systems [20] are three of the most

prominent players in this market. These new reconfigurable architectures are being

marketed as platforms that can be used for implementing a wide range of applications

from different domains. The Annapolis WILDCARDTM reconfigurable platform that we

are using on our research is one of these—albeit a low-end version.

The research described in this thesis examines the architecture, design and

implementation of the computationally intensive UPGMA algorithm on a low-end

reconfigurable platform and evaluates the performance as contrasted with that obtained

by an implementation of the same algorithm using conventional software program

execution on a standard Intel® CPU-based personal computer.

As discussed, we have chosen the UPGMA phylogenetics algorithm as the

application domain in which we will explore the architecture and design space, and

subsequent performance differences, of applying the reconfigurable computing paradigm

to this scientific computing problem. Phylogenetics provides us with a rich variety of

problems with complex computational tasks that can be studied to see if they can be

implemented on reconfigurable machines. Furthermore, the software domain has already

been thoroughly explored, and few performance gains can be realized from further

software optimization of the UPGMA algorithm in particular.

With this rationale clearly in mind, we progress to our discussion of the problem-

solving and analysis of the UPGMA domain to derive a suitable high-level architecture

with which to implement the algorithm. In addition, we’ll need to play our architecture

off against the resource and timing constraints of the underlying Xilinx device and the

WILDCARDTM platform (including its mechanisms for interfacing with the PC-based

host system in which it resides).

CHAPTER 3

DISCUSSION OF THE WILDCARD CUSTOM COMPUTING PLATFORM

In this chapter, we discuss the reconfigurable computing platform we have

available to us for purposes of this research. As part of our analysis, we had to

thoroughly analyze this platform in order to understand its operating environment, its

programming model, and its key features and constraints. All of this was required prior

to devising an architecture for our UPGMA solution, because any such architecture

would both be constrained by the resource constraints of the Xilinx device resident on the

WILDCARD, but also we would be further constrained by the programming model and

execution environment provided by the vendor for us to realize our solution.

3.1 The Annapolis WILDCARDTM System

The WILDCARDTM system comes as a PC card and can plug into a PCI/PCMCIA

card slot adapter, making it a very portable low-end reconfigurable platform. It has a

very compact architecture, with a single Xilinx Virtex XCV300E processing element

(PE) and a couple of independent memory modules on the either side, forming the core of

the system. The architectural block diagram is given in Figure 7 below.

Figure 7.The WILDCARDTM Platform Block Diagram [18]

Each of the two memory blocks, referred to as the Right and Left memory banks,

is a 64K x 32-bit RAM module, with a 19-bit address bus and a 32-bit data word. The PE

can write and read from the right and left memories independently. The host interface is

through a 32-bit CardBus (PCMCIA) controller that operates at a 33 Mhz clock

frequency. The CardBus controller interfaces with the PC host through the PCI Bus

interface, and with the PE through the LAD Bus interface.

Data transfers to and from the PC host are done through control of a set of C

program driver calls that interface with CardBus controller which, in turn, interfaces with

the LAD Bus to send data to, and retrieve data from, the PE. Data can be written from

the host to the memory through these specific interfaces by making the C program calls

provided by the Host Programming Application Programming Interface (API) provided

by the vendor.

The PE also has certain Input and Output pin connections that enable it to connect

to external devices. These pins are helpful when you want your application program to

communicate with an external device.

The WILDCARDTM board has a Frequency synthesizer that generates one main

global clock signal, F_Clk, which is used to derive three other global clocks, namely the

P_Clk, M_Clk, and K_Clk. The user can set the clock frequency of the F_Clk using a C

routine call from the host. The P_Clk is the PE clock pad signal, and is set to half the

frequency set for the F_Clk by the user. The M_Clk the memory clock pad signal and

operates at the same frequency that is set by user for F_Clk. Finally, the K_Clk is the

CardBus/LAD Bus clock pad signal, which always operates at 33 MHz.

3.2 The WILDCARDTM System VHDL model

The WILDCARDTM system software package provides VHDL models for the

whole board that can used to create a VHDL-based program model and to implement and

debug the whole reconfigurable application design. The VHDL model of the system also

contains a simulation model of the host system that is used for testing the application

from the perspective of custom computing hardware-software co-design.

The VHDL model provides interface components that are used to access all the

components on the WILDCARDTM system. There are two types of interface components,

namely, the Standard Interfaces, and the Mux Interfaces [18].

Standard interfaces are simple interfaces to the devices (PE, memories) on the

system and can be used for low level, specifically tuned applications. The Mux (or

multiplexing) interface can be used for programming at a higher-level interface between

the LAD Bus, Memory and the PE components. Both of these interfaces allow multiple

user application components to share a single resource (such as the LAD Bus or the PE’s

memory banks).

The development environment provides VHDL models for the following platform

components, which are used for early model integration, hardware-software partitioning

analysis, and functional verification and clock cycle-level timing analysis.

Processing element (PE)

Right Memory Bank

Left Memory Bank

Host

Clock Generation

Input and Output Connectors

PCI controllers

The PE VHDL model is a standard VHDL entity-architecture pair. The entity defines

the input and output pads of the PE device. The pad numbers are logical and do not

match with the physical pin numbers of the FPGA die. This entity definition is fixed and

is used as a template for the physical PE while creating the application design. In our

preparation of these components for exploring the space of possible architectures for the

UPGMA algorithm, we do the following: the PE architecture template is modified

accordingly to embed the application design within it. Furthermore, the Standard and

Mux interface models are used in such a manner in order to embed the application design

interface with the LAD Bus and the memory banks. Finally, this allows us to take the

resultant composite PE model and synthesize the PE image for actually configuring the

WILDCARD device.

The VHDL models provided for the memory banks, host, clock generation, I/O

connectors and the PCI controllers are purely for simulation purposes. These models are

used within the WILDCARDTM simulation model and encapsulate the system’s

functionality for use in VHDL simulation, enabling us to functionally verify the PE

designs, as well as validate the correctness of the UPGMA algorithm, before synthesizing

the actual design units and placing and routing them onto the Virtex device resident on

the WILDCARD.

3.3 WILDCARDTM Host Programming

The WILDCARDTM system is composed of three main components, listed as

follows: (1) the WILDCARDTM board; (2) the WILDCARDTM device driver; and, (3) the

Host Application Programming Interface (API). The WILDCARDTM Software Design

Hierarchy is given in Figure 8. The Host programming is done in C language using the

standard Host API routines to communicate with the WILDCARDTM board through the

Windows® based device driver.

The device driver provides a low-level hardware interface to the WILDCARDTM

board. When the driver is called in the appropriate set of driver function codes, it

initializes the WILDCARDTM in a sequence of steps by reading its configuration and

establishing handler interfaces for memory, interrupts and DMA operations. The

WILDCARDTM API presents a generalized view of the hardware resources and control

operations. The following operations are performed by calling the API routines:

Opening and Closing the WILDCARDTM board

Clock control (frequency)

Processing Element control (program, reset, register space)

Memory Interfaces (read/write)

Interrupt control (PE/FIFO enable/disable)

The C function routines for each of the above operations are discussed below.

Figure 8. The WILDCARDTM Software Design Hierarchy [18].

3.3.1 Opening and Closing the WILDCARDTM board

The host program first makes an “open” call to the WILDCARDTM board before

performing any other operations. This initializes the device driver, which, in turn,

initiates the interface handlers for access to the board components. The C routine for this

is WC_Open( ). The counterpart for the WC_Open( ) function is the WC_Close( ). For

every WC_Open( ) there should be a corresponding WC_Close( ) function call to ensure

a clear disconnect and proper de-allocation of resources.

3.3.2 Clock Control

The only clock control operation a host program can perform is setting the

frequency of F_Clk. The function call for this is WC_SetClkFrequency( ). The

programmable clock module allows user programs to change the clock frequency

anytime by calling this routine.

3.3.3 Processing Element and Interrupt Control

The four main operations that are executed against the Processing Element (PE)

are: (1) the PE Reset; (2) the PE Program; and, (4) the PE Register space read and write

(3) the PE Interrupt control.

The PE Reset operation is used to reset the PE and the embedded application

residing on it. The PE program function calls are used to program the PE device with the

user-designed application. There are two function calls: (a) PE_ProgramFromBuffer( ),

which is used to program the PE from a user buffer space; and, (b) PE_Program( ), which

is used to program the PE from a file.

The PE has a certain register space to which we can read and write to. The

register space has an address range of 0x04000 to 0x0FFFF. The two function calls to

read and write to this register space are: (a) PE_RegRead( ) - reads from the PE register

space locations; and (b) PE_RegWrite( ) – writes to the PE register space locations.

When a PE interrupt occurs, the device driver immediately masks the PE interrupt line

and informs the calling program that is suspended on the API call that an interrupt has

occurred The Interrupt control is done using the following functions [18]:

WC_IntQueryStatus( ) – checks the status of the PE interrupt line via polling.

This is useful when the host program is written to do other operations while

waiting for the interrupt.

WC_IntWait( ) – waits for the PE interrupts; useful when the host only needs to

wait for the PE interrupt before proceeding to perform anything else. The calling

program is suspended.

WC_IntReset( ) – after the host program has processed the interrupt, it can reset it

and clear the API’s indication of a pending interrupt.

3.3.4 Memory Control

There are two main memory control API calls that are made by the host C application

program. They are: (1) WC_MemRead( ), which reads from the right or the left memory

SRAM banks; and, (2) WC_MemWrite( ), which writes to the right or the left memory

SRAM banks. The calling arguments include the memory bank identifier, the base

address and size of the block of data (that is, the number of DWORDS) to be written or

read [17]. The function calls invoke the device driver that, in turn, manages handlers to

transfer data through the PCI bus through the CardBus/LADBus interface to the

WILDCARDTM.

3.4 PE Embedded Application Initialization

The host program must proceed through a set of steps for initializing the application

in the PE device before the reconfigurable application can be started. The steps are as

follows:

Open the WILDCARDTM board;

Initialize the clock by setting it to a particular frequency;

Enable the PE Reset line – ensures the PE is reset atleast once when the clock

starts;

Disable and clear any pending interrupts left hanging by any previous

applications;

Load the PE image by calling the PE_program API routine;

Execute any additional initialization tasks as necessary, such as for enabling PE

interrupts; and,

Disable the reset lines to allow normal operation of the PE.

Now the downloaded UPGMA application is running on the WILDCARDTM

proceesing element, and the host program can start its portion of the application

processing activity, which consists of transferring taxa and phylogenetic tree data to and

from the offloaded UPGMA algorithm application.

CHAPTER 4

CUSTOM COMPUTING DESIGN OF UPGMA

In this chapter, we present the design methodology employed to create the custom

computing application that offloads the UPGMA algorithm from the Host onto the

WILDCARD board for accelerated processing of taxa data.

4.1 VLSI Design Flow

New VLSI design methodologies have emerged every four or six years. Hardware

description languages and EDA tools have made it possible to design VLSI systems at

higher levels of abstraction. Hardware description languages provide chip designers with

the capability of describing the functionality of a design at a higher, more abstract level

of representation than a gate level representation. VHDL and Verilog HDL are the two

languages used in the industry for hardware modeling and implementation.

Figure 9 shows a HDL-based design process model representing the various design

activities. A brief description of each of the activities in the design process model is given

below:

System specification is an activity to abstract design information from a problem

statement and defining the interface and timing waveforms of the system.

System partitioning is an activity to hierarchically decompose a system to handle

complexity based on system specification, design resource, and feasibility of

implementation. Components at the final hierarchical level should facilitate

behavioral modeling using HDLs or allow for HDL component reuse. The output of

this activity is a valid system partition.

Figure 9. An HDL-based design process model[21]

Modeling or adaptation involves capturing a design component in HDL with

high-level timing information and data dependencies or adapting reusable HDL

components from a design library.

Component simulation verifies functional behavior and high-level timing of each

component using HDL test benches or cycle-based simulation techniques.

System binding is structural integration of simulated components based on the

system partition. This activity produces a system model for verifying system

specification.

System simulation verifies system behavior and timing using HDL test benches or

cycle based simulation techniques. This activity produces simulation results that

can be verified with system specification.

Logic synthesis is obtaining a gate-level netlist using an automated synthesis tool.

Removal of timing information and non-synthesizable constructs, technology

mapping and defining area and timing constraints are involved in this activity.

The target ASIC library and constraints are chosen to comply with system

specification.

4.2 UPGMA Project Design Flow

The design methodology described above forms the basis for building the

application design. The UPGMA algorithm undertaken for this project forms the

problem statement. It was analyzed and a system specification generated. The design was

partitioned into easily manageable blocks and modeling of each of these modules was

done in VHDL. The top-level hierarchical model is structural and merely connects the

different sub modules to form the final top-level design. We have employed a certain

amount of adaptation by reusing Xilinx cores to implement certain modules in the design.

This was done mainly to ensure that the design use no more resources than were available

in the Virtex XCV300E chip of the WILDCARDTM board.

The component simulation was conducted on each of the sub modules to verify

their functionality. This step eases the final system level simulation process as the errors

within the sub modules are removed by then. The final top-level structural model was

then written and system simulation conducted to verify the functionality of the design as

a whole. The ModelSim simulation environment has been used for debugging and testing

purposes.

Logic synthesis was conducted mainly for identifying the critical paths and

finding the resource usage of the design. The Synplify Pro® 7.3 FPGA synthesis tool was

used for this purpose. The synthesized gate level netlist does not entirely form the final

design being implemented on the Processing Element (PE) of the WILDCARDTM system.

The functionally verified design was embedded within the PE VHDL architecture model.

The PE model was then placed within the WILDCARDTM system simulation model and

the final functional testing was conducted. The verified PE model was then synthesized

and the EDIF netlist generated. The EDIF netlist was then placed and routed using a

make file provided by the WILDCARDTM system. The make file provided calls to the

Xilinx place and route tools for generating the final PE image that is used to configure the

device. This image is then used to proceed with the WILDCARDTM host programming

process shown in Chapter 3.

4.3 UPGMA Design

The design architecture was formulated keeping in mind the parameters that

govern the data sizes upon which the design operates. The data bit-width size constrains

the bit widths of registers, the bit widths of datapath path elements and the memory

requirements.

We first look at defining these parameters and then move towards describing the

design architecture.

4.3.1 Design Parameters

For a taxa size of n:

The number of nodes that form the final tree are n + (n-1)

The number of distance values that need to be stored is

o (n(n-1) + (n-2)(n-3))/2

o For n nodes there are n(n-1)/2 combinations, thus making the number

of nodal distances to be n(n-1)/2.

o When an internal node is formed the number of nodes still left to be

connected reduces by 2 for the very first internal node and then by 1

thereafter. Initially there are n external nodes. When the first internal node

is formed by joining two external nodes, the number of nodes left is n-2.

So we need to compute the distance of the new internal node to n-2

different nodes. Thereafter, for each internal node formed we have a

reduction of 1 node making the number of nodes to connected to be n-3,

n-4 and so on. The total number of distances for internal nodes is thus n-

2 + n-3 + n-4 + …… which is equal to (n-2)(n-3)/2.

We define below the data structures used to representing the input, intermediate and

output data. There are four types of data upon which the algorithm operates:

Distance data – A simple 32-bit dataword is used to represent this value. The

distance data is stored in the left memory bank that has a 32-bit data width. Thus

the choice of a 32-bit word for representing the input distance data.

Nodes – The nodes that form the tree are represented as 10 bits each. The choice

of 10-bits was made as initially the design was targeted to implement a 512-taxa

dataset.

Heights – Each node has a height associated with it. This indicates the number of

nodes beneath it. This values has been represented in 16-bit format. A 512 taxa

tree would have a root with the largest height of 512. An initial size of 9-bit was

selected in earlier versions of the modeling but was changed to 16-bits when the

16-bit Block RAMs were chosen to implement the Height memory that is used to

store the heights of the nodes.

Tree Output Data – This data represents each node in a tree with its parent node

and branch length to its parent associated with it. The format used is given below:

Node ID- 10-bits Parent ID- 10-bits Branch length- 12-bits

This format was used to connect up the nodes while generating the final tree. The

total bit length is 32-bits. The tree output data is stored in the right memory that

has data word length of 32-bits

4.3.2 Design Datapath

The basic datapath of the design that is used for calculating the average distances

and also obtaining the minima is given below in Figure 10 . The datapath is broken into

two, one used for finding the least distance or the minima and the other for calculating

the average distances. The datapath on the left of the Figure 10, with the less than

operator is used to find the minima. The datapath has the distance value and the current

minima as inputs. The current minima is stored in a register and is fed back into the

comparator. The second datapath on the right is used to calculate the average distance. It

has the distance value dik and the height of node i as inputs. The multiplier obtains the

product of this height and the distance and sends it to the adder. The adder adds this value

to an accumulator. The multiplier and adder together obtain the numerator part of the

average distance equation (2) given in Chapter 2. The same time that the multiplier

accumulator are computing the numerator, the second adder which takes height Hi as

input, computes the denominator of the equation (2). When these two are computed the

resulting values given below are sent to the divider to obtain the average distance.

Numerator = (dikhi + djk)

Denominator = hi + hj

Average distance = (dikhi + djk)/(hi + hj)

Figure 10. Design Datapath

4.3.3 Design Architecture

The architecture of the design is given in the block diagram shown in Figure 11. The

architecture shown is the final one created after several design passes. The two main

components that form the backbone of the design are:

The Controller

The Address Generator

Most of the design effort has been put into modeling these two components since the

controller executes the working of the UPGMA algorithm and the address generation

forms the core part of the algorithm’s process. The description of these two components

along with that of the other sub components is given below. We start with the simpler

ones and then proceed later to the complex components. The datapath components are

dealt with first, then moving to control path and finally the memory modules.

4.3.1 Adder

The Adder is modeled using the simple VHDL “+” operator. It has two data

inputs and one output, each 32 bits wide.

The basic architecture of the 32-bit Adder would be realized using the Ripple-

Carry design. This style of Adder architecture has its carry chain as its critical path;

however, for this bit-width, the tradeoff in area versus speed was not significant enough

to warrant exploring other, more sophisticated Adder architectures.

Output Selector

Controller

Multiplier

HAdder

AddregMulReg Adder

HAddReg

Comparator LeastDistanceReg

RowReg

ColReg

Divider

Address Modification

Divreg

HeightMemory

LstDinp

MuloutAddinp

Din

pAdderreginp

Dis

tinp2

Dst

inp1

HRead

HWrite

Sel

ect

Sig

nals

Col

Reg

En

Row

Reg

En

LstDstRegEn

Mem

Rea

d

Mem

Writ

e

DivRegEn

HeightRegEn

AddRegEn

Address, Read and Write signals to Leftand Right Memory Banks

AvgDistanceOpt

Tree output Data

Dat

a In

puts

to L

eft a

nd R

ight

Mem

ory

Ban

ks

MulRegEn

Height

Rowinp

Colinp

Address Generator

mad

r

mem_addr_modified

HAddrRegVal

Data_in

Address Generate Control

number_of_species

Row

Col

control signals

Clk

Reset

text

Avg

Dis

t

Figure 11. Block Diagram of UPGMA Architecture.

4.3.2 Add Register

The Add register stores the output value of the Adder module. The register value

is fed back into the Adder so that these two components work together as an accumulator.

The Adder input is added with the old value stored in the Add register and the cumulative

value is stored into the register. The final value of the Add register forms the numerator

of the Average distance equation given in Chapter 2. The Controller module sets the

enable and clear signals for the register.

4.3.3 Height Adder

The Height Adder is similar in architecture to the basic Adder and has two inputs

and one output, each of which is 16 bits wide. This module is used to add the heights of

the nodes of the tree.

4.3.4 Height Register

The architecture of the Height Register is similar to the Add Register, except that

it is 16-bits wide. It stores the output value of the Height Adder and its value is fed back

into the Height Adder so that they work together as an accumulator. The Height

Register’s final cumulative value forms the denominator in the average distance equation

given in Chapter 2.

4.3.5 Multiplier

The multiplier is a simple 32-bit multiplier modeled using the standard VHDL “*”

operator. The multiplier has two inputs and one output, each 32 bits wide.

4.3.6 Multiplier Register

The Multiplier register stores the Multiplier module’s output. The register

component’s architecture and pin configuration is the same as that for the Add Register.

From the standpoint of the VHDL model, a single generic entity-architecture description

is employed for all the 32-bit register units.

4.3.7 Divider Unit

The Divider was modeled using the shift-subtract division algorithm. The divider

forms the critical path of the design, and the controller coordinates its operation to ensure

that the design runs at the requisite clock rate. The Divider has two 32-bit inputs and has

as outputs a 32-bit quotient and 1-bit “valid” flag.

4.3.8 Divider Register

The Divider register simply stores the output value of the Divider unit. Its

architecture and pin configuration are similar to that of the Add and Multiplier registers.

The controller handles the enable and clear signals.

4.3.9 Comparator

The Comparator is used to compare the distance values and find the minima. The

comparator is modeled using simple VHDL “>” operator. The comparator compares the

new distance value with the previous minima stored in the Least Distance Register; when

it finds new minima, it enables the Least Distance Register, Row Register and the

Column Register.

4.3.10 Least Distance Register

The Least Distance Register stores the current minima while the algorithm

continues to search for the minima of all the distances. The register architecture is

similar to that of the generic 32-bit register with clear signal being set by the controller

and the enable signal being by the comparator.

4.3.11 Row and Column Registers

The Row and the Column Registers store the Row and the Column values of the

current minima in the distance matrix. Each distance matrix is accessed through the row

and column values. These two values are used to obtain other distance values while

calculating the average distance value. The comparator sets the enable signal and the

controller sets the clear signal controlling these registers.

4.3.12 Controller

The Controller forms the core of the design, making its behavior one of the most

complex to model. Its operation is based on the processing steps of the UPGMA

algorithm. The steps for the controller are given in Figure 12, below.

Figure 12. The Controller Algorithm.

The three main operations being performed by the controller for every single pass

through the matrix are as follows:

Find the new minima

Compute the Average distance

Reduce the matrix size

The controller accomplishes this by stepping through a set of states and repeating the

process until all the nodes in the tree have been handled.

4.3.13 Counter Units

The counters form some of the sub components of the address generator block.

The counters are used to select the next address to be generated. The generic block

diagram for one of the counters is given below. The counter counts up until the count

value becomes equal to the compared value (CV); at which point they set a “great” flag,

indicating that the count has been exceeded; the counters are then reset back to zero. The

controller sets the Increment and Clear signals.

4.3.14 Multiplexers

The multiplexers enable the address generator in selecting the next address. We

have 2:1 and 3:1 multiplexer architectures for this purpose.

4.3.15 Address Generator

The address generator is a module that underwent several design cycles. The

address generation algorithm is not simple, when considering it from a hardware

perspective. The address generation is conducted for two operations in the design

process.

Finding distance minima in an instance of the matrix; and,

Calculating average distance

For finding the minima, the address is generated for fetching the next distance value

from the memory. The address is generated in the format of “row&column”, with the

row and column values concatenated together to represent the actual memory address.

The row and column values represent the row and column of the node-to-node distance

matrix, with each row or column representing a node.

For example, node 1 to node 2 distance can be fetched by concatenating

“0000000001” with “0000000010” to generate the 20-bit address

“00000000010000000010” before actually accessing that memory location. These two

values are obtained by reading a memory that stores the currently active node values.

The earlier version of this memory structure, referred to as “node memory” was

modeled behaviorally and later modified to reduce the size of the module. The earlier

model used to take more resources than were available in one full Virtex XCV300E chip.

The model was later modified by implementing the node memory using 256 x 32-bit

Xilinx Block RAM that would use one single Block RAM resource on the Virtex

XCV300E chip. This reduced the resource consumption significantly, making the

module work much more efficiently.

The Xilinx Block RAM is a dual port memory structure [16], structure so that two

different addresses can be written or read, or read and written, in combinations at the

same time. This enables us to read the two different node values at the same time in order

to generate within one clock cycle the concatenated address. The block diagram of the

dual port Block RAM is given above.

The counters, discussed earlier, are used to generate the address values, ‘addra’ and

‘addrb’, for selecting the nodes to generate the next address. After all the distances are

read and compared, the minima is stored in the Least Distance Register.

The average distance calculation needs generation of the address by selecting nodes

that have not yet being joined into a cluster. The nodes that have currently been selected

to form the new cluster are stored in the Row and Column registers. The average distance

equation is given below.

Avg. Distance D(x, y) i = (HxDxi + HyDyi) / (Hx + Hy)

The x, y are the new nodes selected to form the new cluster, i is the node to which

the distance from the new cluster is being calculated, Hx, Hy are the Heights of nodes x

and y respectively, and Dxi, Dyi are the distance of nodes x and y to the node i,

respectively. The address generation for calculating the average distance is done in steps

outlined as follows:

The address for obtaining Dxi is generated first by selecting node i’s value and

concatenating with node x’s value stored in the Row Register;

The address for obtaining Dyi is generated second by selecting node i’s value

again, and concatenating with node y’s value stored in the Column Register;

These two distances are added to calculate average distance D(x, y) i ; and,

This new distance is then stored into the memory for future reference. The new

cluster forms a new node, lets say j, and the average distance calculated

represents the distance of this cluster to node i. Thus, the address for storing this

distance would be cluster j’s value concatenated with node i’s value.

The above four steps are performed for all of the nodes that have not yet been

connected to the tree. After the average distances of all the nodes have been calculated,

the node memory is updated by removing the nodes that have been selected to form the

new cluster. The complexity of the process lies in maintaining the node memory and

stepping through the process of selecting memory’s addresses to obtain the next address.

4.3.16 Output Generator

This module selects the output values that form the output tree data that is written into

the memory. The outputs that this module are listed as follows:

Node type – internal or external leaf node;

Node ID- the value of the node;

Parent ID- the value of the parent of the node; and,

Branch distance- the distance of the node to its parent.

4.3.17 Height memory

The Height memory holds the heights of the nodes in the tree. The external, or

leaf, nodes would have a height of one and the internal, or nodes with children nodes,

would have heights of two or more. The memory architecture in the earlier version of

this design was implemented behaviorally and the post synthesis results yielded high

resource usage and slow performance. Thus, after a design review, this module was

implemented using four 256x16-bit Xilinx block RAMs and additional write and read

logic associated to control the Block RAM accesses. The module now uses only four

Block RAM resources on the Virtex XCV300E chip.

4.3.18 Off-chip Memory Banks

The right and left memory banks on the WILDCARDTM are used for storing the

distance and tree data respectively. The banks are 64Kx32-bit SRAM modules, where

access to them is managed through interface modules provided by the WILDCARDTM

system. These components are available as VHDL models that can be used depending on

the needs of the application. If we need to write to left and right memories we need the

interface components provided for these two banks. The application on the PE sends a

read or write request through these interface components. The components allow multi-

processing such that multiple applications could read and write to the memory at the

same time. The read and write requests are prioritized and the designer can chose the kind

of prioritization used. This feature has not been used as we have only one application

design running within the PE that reads and writes to the memories.

The read and write operations take certain cycles to be performed successfully.

Figures 13 and 14 give the timing diagrams for the read/write to/from the memory. The

read cycle is such that the first data takes 4 clock cycles to arrive after the read signal is

set.

Figure 13. Typical read cycle from memory[18]

Figure 14: Typical write cycle from memory[18]

As we can notice from Figure 13, the write takes only one clock cycle to be

performed. The 4 clock cycle latency requires the controller to wait until the first data

arrives before performing the datapath operations. The memory read thus causes a certain

latency in the design and most certainly affects the performance. For larger taxa sets, this

latency gets larger and drastically affects the speed of the design.

4.3.19 Addressing Schemes

The WILDCARDTM memory banks both have a maximum capacity of 65536

words of 32-bits each. This small memory capability not only limits the number of taxa

that could be implemented on this system but also affects the way the memory is

addressed. The addressing scheme discussed in Section 4.3.15 is not effective when the

number of taxa increases to say 256. The scheme uses the following methodology. Let us

assume we have a distance matrix given as Figure 15 below.

Figure14: Distance matrix

Figure 15. Distance Matrix

The distances 6, 8, and 3 are addressed by D[0, 1] D[0, 2], D[0, 3]. Thus, while

writing data from the host, we place these three values at indexes 1, 2, 3 of the array and

transfer the data to WILDCARDTM memory. These values would be written in addresses

1, 2, and 3 of the memory. Thus value 6 in address 1 of the memory can be referenced

using the address “00000000000000000001” obtained by concatenating the nodes

“0000000000” and “0000000001” together, and similarly for locations 2 and 3.

Now, distance 7 is D[1, 2] and thus is written in index 1026 of the C array and

transferred to the memory. It can thus be referenced by generating the address

“00000000010000000010,” obtained by concatenating nodes “0000000001” and

“0000000010” together. These nodes represent the indexes of the matrix D. Thus we use

an addressing scheme that is similar to the way we reference the matrix values.

For datasets of 256 taxa or higher, this scheme fails, since for obtaining distances

between say, nodes 254 and 255, we have an address “00111111100011111111” which

represents a value much larger than 65536. Also, using this scheme, we are wasting

memory locations. For example, the consecutive distance values 6, 8, 3 are stored one

after another in locations 1, 2, 3, respectively, but the distance value 7 suddenly jumps to

the memory location 1026. This waste of memory locations would reduce as the taxa size

increases, but it is still unacceptable.

To avoid this problem and to be able to implement larger taxa datasets we have to

employ a linear addressing scheme. The catch in this scheme is that our design needs to

maintain a record of the node information while fetching every distance value, so that we

can know which two nodes have the minimal distance between them. Thus the address

generation using concatenation of nodes is important for the design. We therefore resort

to an address modification scheme in which we generate the address in the original

scheme and then modify it into a linear 16-bit address that does not go beyond our limit

of 65536.

The address modification is a complex process, takes additional clock cycles and

requires additional states in the control structure. This causes the design to slow down

and the performance is affected quite a bit. We will discuss the impact of the address

modification on performance in later chapters.

Let us assume in the address “node1&node2” node1 refers to the row of the

matrix and node2 to the column. Thus, for the matrix given in Figure 15, we have the

following mapping for each address value as given in Table 2. The number of taxa is n =

4.

Matrix format Linear format Number of values per row0-1 0

n-10-2 10-3 21-2 3 n-21-3 42-3 5 n-3

Table 2: Address mapping

From the above table we deduce that each row maps to a particular address. For

example 0 maps to 0; now for value 0-1 we have address ‘0’, for 0-2 we have ‘1’. So we

can see that for column value 2 the address 0 to which the row maps is incremented by 1.

Similarly, for 0-3, the address 0 is incremented twice. Thus we deduce that by mapping a

row to an address and adding the (column-1) value to it we obtain the linear address

that is used to obtain the required distance value. Row 0 would have (n – 1) values

and a base address of 0, thus the base address for row 1 would be (0 + n - 1) which

equals 3 for n = 4. Thus for address 1-2 we have an address of 3; for 1-3 we need to add

(column – row – 1) to obtain the correct linear address. Thus the steps used to

obtain the linear address are:

Obtain the base address to which the row maps to from a map memory

Add the value (column – row - 1 ) to obtain the final address

The above methodology is employed to perform the address modification. The base

addresses that each row maps to are written initially in Block RAM. This initialization of

the node memory, row map memory requires further additions to the controller states

and thus adds considerable delay to the design. This delay gets large for larger taxa

datasets. The address modification component is placed within the PE outside of the

UPGMA application component. The address generated from the UPGMA component is

fed to the address modification component, and the modified address is fed to the

memory interface components.

4.3.20 Top-level Block

The top-level block in the design provides integration and routing of all the sub

modules described above. The final top level design is then placed within the VHDL

model for the PE, as a sub component to that model, and is interfaced with the memory

and LAD Bus interfaces for handling the transfer of data.

The VHDL models for all the blocks in the design are listed in Appendix A. .

4.4 Design Verification

The design verification of the UPGMA design was performed using the

ModelSim simulation environment. The VHDL models of the WILDCARDTM system

provide a simulation model that could be used to run a host-based simulation. The

simulation was done for various taxa datasets. The benchmark dataset used was a 57-taxa

dataset for which we had the output resultant tree data generated from the software

implementation. The output tree generated by the software simulation of the hardware

design was compared with the benchmark data and found to match quite.

To verify the working of the hardware implementation on the WILDCARDTM

system the data generated from the hardware implementation was compared with the

benchmark data. Both the results matched perfectly.

Data generated through the test data generating software was fed to both software

and hardware designs and the resulting output was compared. We found that the two

outputs matched, indicating that the hardware design was working correctly.

CHAPTER 5

EXPERIMENTAL DATA SET AND PERFORMANCE MEASUREMENT

5.1 Experimental Apparatus for UPGMA

The WILDCARDTM host-programming environment provides us the capability to

program the WILDCARDTM system and also allows us to create templates that are used

to write the host program. Looking back at the software design hierarchy explained in

Chapter 3, we see that the host “driver” program is written in C. The WILDCARDTM

provides the API routines that are used in the host program to perform the following

functions: (1) read and write to the on-board SRAM memories; (2) wait for the Virtex®

PE to interrupt (or, alternately, poll the status register for completion of a WILDCARDTM

controller operation), and (3) process the results of the API-initiated operation.

The UPGMA host program was written based on the example templates provided by

Annapolis Microsystems® for setting up a custom computing application for reading and

writing the SRAM memory banks, reading and writing data to the Virtex® Processing

Element (PE) register space, and for processing PE interrupts. Using these examples as

guides, we created a complete host-based, experiment “driver” application, employing

the above three components, to perform the following host-to-computing server protocol

steps:

Initialize the WILDCARDTM system;

Program the PE from the image file;

Set the Clock frequency;

Enable PE interrupt line;

De-assert PE Reset line;

Read distance data from the file into a distance array;

Transfer distance data from the distance array to WILDCARDTM Left memory;

Write the value of the number of taxon being operated upon into a PE register;

This triggers the design to start running and assert “done signal” after it finishes;

The C program waits until the done is set;

Reads data from the Right memory; and,

After reading, it outputs the data into a destination file in the PC host file system.

The PE initialization includes “opening ” the board by calling WC_Open( ) routine,

applying power to the board, asserting reset lines, and clearing any pending interrupt

requests left unprocessed by previous application programs. Once the design has been

synthesized, and the EDIF file is transformed into a placed and routed design for the

Virtex® FPGA from the synthesis run, and the image file is generated by running the

Xilinx® M1 Alliance Series place and route tools.

This image is placed with the C project directory and is used to configure the PE by

calling the WC_PeProgramFromFile( ) API routine. After the PE image is loaded onto

the device, the PE clock frequency is set by calling the routine WC_SetClkFrequency( ),

the interrupt lines are enabled, and the Reset line de-asserted. The WILDCARDTM board

is then ready for transfer of data to the on-board SRAM memories.

The distance data is written into the left memory, while the right SRAM memory is

used for storing the output tree data. The number of taxa on which we are operating is

written into a single 32-bit register on the PE. The host C program then goes into “sleep”

mode, waiting for the PE interrupt to be set. Meanwhile, the UPGMA logic starts

executing, and operates on the distance data in order to generate the output tree data.

After the design finishes processing, it generates a “done” signal that is tied to the PE

interrupt line. Once the PE interrupt is set, the host C program comes out of its wait state

and starts processing the PE interrupt. The host program clears the interrupt and starts

reading the Phylogenetic tree data from the Right SRAM memory. Once all the output

tree data is read from the right memory and written into an output file, the “driver”

program clears all the memory buffers allocated during the execution, and proceeds to

“close” the device by calling the WC_Close( ) API routine. The C code for the host-

based experiment “driver” program is provided in the Appendix D.

5.2 Generating Random Taxa Test Data Sets

A program written in C++, using the MFC programming environment, was used

to generate the test data for testing the implementation of the UPGMA algorithm. The

program takes as input the following parameters: (1) the number of taxa; (2) the

maximum value of inter-node distance; and, (3) the number of repetitions of a single

distance value in the data set.

The test data are generated for taxa sizes of 10, 16, 32, 50, 64, 75, 100, 128, 150,

175, 200, 225 and 256. For each taxa size, ten different data sets are randomly generated

for that number of taxa. Furthermore, each created data set has its data values subjected

to permutations, creating up to 10 permutations per data set per number of taxa. The C++

code for the test data generation is given in Appendix E.

Figure16: Test Data Generator Input Dialog Box.

Figure 16 presents a screenshot of the dialog box used by the program for

generating test data. The taxa size, maximum nodal distance, and the number of

repetitions of a particular distance value are given as inputs. When the data has been

generated the program pops up a confirmation dialog box.

The data values are generated randomly making sure that each value is within the

maximum nodal distance limit set by the user. Also, the specified number of repetitions

of each value in the data set is constrained to be less than or equal to that specified for

repetitions by the user. For each taxa size, ten different datasets are generated and for

each of these ten datasets ten different permutations are generated by changing the

positions of the distance values within the distance matrix.

5.3 Measuring Time

The time taken for the UPGMA implementation to execute on the WILDCARDTM

system is measured using standard C time function calls. Time measurements are

collected for the time taken for the program to transfer the distance data to the memory,

generate the tree, and read back the output tree data from the memory. The time is

measured in terms of CPU clock ticks using the standard C language clock( ) function

call.

The time taken for memory transfer is measured separately in order to analyze the

cost of transferring data to and from the WILDCARDTM memory banks. This is done to

give us an idea of how the cost affects the performance of the implementation for high

values of N, the number of taxa. The current maximum of 256 taxa limits the number of

distance values to be written to the memory. Also, the WILDCARDTM memory banks are

65536 (64K) words, with each word being 32-bits in width. This constrains the number

of taxa that can be operated on for a given UPGMA run.

Through independent tests written for the WILDCARDTM system, the time taken

for writing to each of the memories has been collected and tabulated. Although not

shown here, the numbers collected indicate that the cost for writing the entire memory is

about 20 CPU clock ticks, while reading back the entire memory is about 100 clock ticks

on the 800 MHz Pentium III processor serving as the experimental workstation. This

indicates that reading data from the SRAM memories from the host is a more expensive

operation than writing to them.

For the purposes of our experiment, we write distance data and read back tree data

from the left and right on-board SRAM memories, respectively. As we are constrained to

operate upon a 256 taxa data set, the number of distance values needed to write a

complete matrix to the memory 32,640, while the maximum number of memory locations

to be read back for the resulting tree is 511. These two values have been obtained from

the following analysis formulae.

Number of distance values = N(N-1)/2

No of tree nodes = 2(N) – 1

Thus, considering that our largest data set has values to be written that are less

than the maximum capacity of the memory banks, the memory writes take less than 20

CPU ticks for writing the entire memory array. Similarly, the number of output words to

be read back from the memory is small compared to the full memory size. Thus, the time

taken is much less than the 100 ticks for reading an entire memory. Thus, the cost of

writing and reading from the on-board SRAM memories does not form a big factor in the

performance overhead for our implementation.

However, if we have to look at realistic values of N, which can go up to as high as

10,000 the cost for reading and writing the memories would important. For obtaining an

idea as to what the cost might be, we can extrapolate the timing data assuming that we

have unlimited memory capacities on our hardware board. We could try to write to the

WILDCARDTM memory banks multiple times to find out the cost for writing more than

65,536 words and use this value to obtain a gross estimate of the cost to write to the

memory for large datasets. Similarly we could read back data from the memory multiple

times and obtain and estimate of the cost to read the memory for large datasets.

This cost data could then used to obtain a gross estimate on the overall

performance of the algorithm being implemented on the hardware. This provides a

theoretical extrapolation, and not the exact performance cost; however, it provides

valuable information on how the algorithm performance might scale to larger number of

taxa data sets, and whether the memory access costs would have a significant or minor

impact (assuming we had reasonably unlimited memory available). This data is

presented in Chapter 7 as part of the discussion of conclusions of this research.

CHAPTER 6

EXPERIMENTAL METHOD AND RESULTS

In this chapter, we present the results of running Phylogenetics data sets against

the UPGMA implementation on the WILDCARDTM-based reconfigurable custom

computing machine. We present the resultant data sets in terms of a bounded clock cycle

count using the clocking frequency of the host PC’s CPU clock, which gives us a count

of the total number of host clock cycles for a given computation run. We use this, as

opposed to using the on-board FPGA clock, as the former takes into account the

communication overhead of getting data to and from the WILDCARDTM board.

6.1 Running the Experiments

We take randomly generated data sets, permute them, and execute them on the

WILDCARDTM. We then increase the number of taxa considered in the input distance

matrix, generate new data sets and permute them, and execute them on the UPGMA

processor. The test data for taxa sizes of 10, 16, 32, 50, 64, 75, 100, 128, 150, 175, 200,

225 and 256 were executed. For each taxa size, ten different data sets along with 10

different permutations of certain datasets were run and timing results collected. The

results are described in the following sections.

6.2 Experimental Results for Latency

The time taken for each taxon-size data measured in number of CPU clock ticks,

for different datasets and permutations is collected. The average time taken for ten

different permutations of each of the ten datasets for each taxon size is given in Tables 4

through 6. Ttotal is the total latency of the platform--the number of clock ticks taken for a

complete design run including the data transfer between the host and the WILDCARDTM.

One aspect of defining the data set for purposes of running experiments is

permuting the data to assess whether permutation impacts the execution latency. In some

implementations of UPGMA in software, permutation might affect the execution of a

given data set at some number of taxa. The permutations were randomly generated along

with the data sets. However, we wanted to make sure whether this aspect of the

organization of data would affect the design in some meaningful way before taking the

time to blindly run the experiments.

Our expectation was that permuting the data would not be much of a factor in

variation of latency values, because the time to perform actual computations on fixed-

width operators is largely independent of the actual data values passed as the operators.

From the data collected from the sample permutation runs, this seems to be the case.

This is shown in Table 3 and Figure 17 below.

Table 3: Timing Results for permuted datasets

Figure 17. Frequency Distribution for Latency versus Taxa Data Set Permutation.

From this analysis of the permutation, we conclude that we don’t need to consider

permutation of the data set values for a particular execution run. Therefore, we focus our

presentation of the data on the different UPGMA execution runs using randomized data

sets for each of the selected number of Phylogenetic taxa.

We next examine the response of the custom computing machine in terms of the

Mean Latency (averaging the data set samples) versus the number of taxa.

This is shown several different ways, so as to highlight the statistical convergence of the

latency values around the mean values computed across the ten randomized data sets.

The first plot in Figure 18 shows the basic Latency response curve as the number

of taxa increases to the maximum value of 256—the maximum number that can be stored

in the available memory on the WILDCARDTM, given the architecture.

We wanted to evaluate the deviation from the mean over the data sets for each

number of taxa, and observe what happens to this deviation as the number of taxa

increases to the maximum targeted for this research. What we see in the Latency data for

the different data sets--for a given number of taxa--is that the data tends to tightly cluster

around the mean, indicating minimal deviation. There is some wider variance as the

number of taxa grows, as evidenced from the curve in Figure 19, which gives the Latency

in log scale. The variance is little bit more for 200 and 256 datasets as seen in the curve.

The rest of the datasets seem to converge pretty well.

We basically are not able to grow the number of taxa on the current

reconfigurable computing platform based on the WILDCARDTM to see whether there is a

real trend in the deviation data or not. However we believe that the data results would

not be affected as the number of taxa grows. This is due to the fact that hardware

computation speed is relatively fixed for fixed data bit-widths. The combinational circuit

would have a fixed latency, thus the computations would have a fixed latency. This leads

us to believe that the data results would not differ significantly with increase in the

number of taxa.

Figure 18. Mean Latency versus Number of Taxa (Normal Scale).

Figure 19: Latency versus Number of Taxa

The deviation we see in our current results, we believe, can be attributed to

“noise” on the host PC side, as the host is not dedicated to running the WILDCARDTM

program exclusively, but at the same time has other processes running that can skew the

count of the clock ticks.

Table 4 given below gives us the timing results for the WILDCARDTM UPGMA

program to run datasets of different taxa sizes. It gives us the latency in the number of

clock ticks for each of the ten different datasets for every taxon size. We also have the

mean, standard deviation and variance of the 10 datasets for a given number of taxa.

We look at the performance of the UPGMA processor and compare it with other

complexity functions to obtain an upper bounding in terms of Big-Oh.

Table 4. Latency Values for Data Sets at Generated Number of Taxa.

6.3 Bounding Time Complexity

Given the Latency curve as our number of taxa grows, we want to understand the

results in terms of the time complexity. Stanat and McAllister [27] provide an

appropriate taxonomy on which we can attempt to qualitatively “fit” our resultant

performance curve against those of standard time complexity functions. Given that we

have selected a means of measuring Latency that incorporates communication overhead,

and that we randomize and permute our data sets, we assume we are working with worst-

case behavior. We want to understand the behavior in terms of the standard forms of

Big-Oh.

Our first attempt is to compare our Latency plot against the base function plots for

O(N), O(n log(n)), O(n2) time complexity patterns. We show the Excel® plots

for the data shown in Table 4 in the plots of Figure 20 for both normal scale and for

logarithmic scale. We attempt to carry out a qualitative assessment of time complexity

bounds without resorting to deriving more precise recurrence expressions—although we

are able to generate curve-fitting equations directly onto the Excel plots.

What we see from the plots is that—given the limited range of N (number of taxa)

covered under the scope of this research—we appear bounded by O(n log(n)) time

complexity. However, the other conclusion we draw from this data is that we are too

constrained by lack of a sizable memory space (space complexity) in which to store a

larger number of matrix distance values for processing a greater number of taxa, N.

Therefore, we cannot draw a definitive conclusion about the performance of the

computing system for large values of N. However, we will explore what we might need

to do to grow to considerably larger values of N, into the thousands of taxa, in the

conclusion of this work. Also, to assess the benefits, we’ll use comparative data.

Figure 20. Bounding of Latency by Time Complexity Functions.

Figure 21. Bounding Latency by Time Complexity Functions Computed in Excel.

Finally, before leaving this aspect of the analysis, we show a different plot in

Figure 21, showing how difficult it is to qualitatively assess the time complexity, by

using the Excel® plot of trend analysis of the Latency curve, showing both square and

cube polynomial trend curves. The Excel software uses regression analysis to come up

with the trendlines. The trendlines help us in predicting the behavior of the Latency

curve with increase in number of taxa beyond our current 256 max size. We just don’t

have enough experimental data ourselves to see what happens to the Latency for larger

values of N. For this, we’d need to move the design to a larger platform—such as the

Star Bridge HC-36m or the SRC 6e, which would be the subject of future research.

However the trendlines give us an idea on characterizing our upper bound performance

for values of N greater than 256, using the trend curves as a guide, and how they might

scale. From the trendline future prediction we can know the R2 (R-square) value. The R2

value, also known as coefficient of determination, ranges from 0 to 1 and helps us in

deciding whether the estimated predictive values of the trendline accurately match the

actual data. A trendline is most reliable when the R2 value is at or near 1. Thus we see

that the cube polynomial trendline provides the best bounding for the Latency curve as its

R2 value is better than that of the square polynomial trendline. So we feel that the

algorithm complexity is bounded by O(N3) for the hardware implementation.

We now, try to measure the quality of the solution by comparing the performance

of the reconfigurable custom computing solution against that of the baseline execution of

PHYLIP, the version of UPGMA software written in C by Felsenstein et al. [10].

6.4 Benchmarking Against PHYLIP

The software timing data is collected by running the PHYLIP UPGMA C code on

the same PC on which the WILDCARDTM host program is run. As before, we execute

the experiment and collect run-time data across a range of values of N, with different

randomized data sets that have been permuted (selecting half the number of permutations

as for the hardware version, for sake of brevity). For this, we use the same data sets that

were used to execute the UPGMA algorithm running on the WILDCARDTM. The

average time taken for the program to run under five different permutations of each of the

ten different datasets for each taxon size is given in Table 5 that follows. The run-time

plots corresponding to those for Latency of the software version are given in the figures

that follow.

Figure 22. PHYLIP C run time performance

Figure 22 shows the comparison of the PHYLIP C run time performance with the

N(log N), and N2 curves. These curves were added using trendlines without predictive

analysis. From this plot, we observe that the PHYLIP C run-time performance curve is

bounded by the N2 curve yet closely follows the plot for N(logN) as its lower bound.

We believe that our limited number of taxa does not show the true nature of the curve,

and thus we would like to get a better bounding to obtain a closer match for the algorithm

complexity.

The first plot in Figure 23 provides us the plots seen in Figure 22 in a log scale. In

this scale we find N(log N) pretty closely matching the C run time performance but we

still cannot predict accurately the complexity of the algorithm for larger values of N. In

the second plot of Figure 23 we have two polynomial trendlines around the C run time

performance curve. We have used forward prediction and obtained the R2 values to look

at the accuracy of the trendlines. We find that the cube polynomial trendline matches

much better with the R2 value very close to 1. This tells us that the C algorithm provided

by Felsenstein[19], seemingly, has a complexity of O(N3).

Figure 23. PHYLIP C run time performance with Time-complexity bounding

Table 5. PHYLIP Run-time Raw Data Set.

We now look at the performance comparison of hardware and software

implementations. The Average number of clock ticks taken for each of the taxa sizes, for

both the hardware and software implementations, is given in Table 6. The results show a

significant improvement for taxa up to 64, but then the rate of improvement starts to

decline as the taxa size increases to the 256 maximum for the experiments.

Taxa Hardware Software Improvement10 8.4 121 14.416 8.5 170.1 2032 9.4 315.3 33.557 12 541.7 45.164 14 713 50.975 20.3 816 40.2100 39.6 942.1 23.8128 71.6 1107.5 15.5150 110.2 1278 11.6175 162.5 1479 9.1200 242.6 1788.8 7.4225 342.1 2250.7 6.6256 504.4 2659.9 5.3

Table 6. Data Comparison Between Hardware and Software UPGMA Implementations.

This behavior in the hardware implementation is accounted for by the fact that the

FPGA-based design used to implement taxa count of 75 and above were adversely

affected by the memory addressing scheme (discussed in chapter 4) made in the final

architecture modifications of the design resident on the WILDCARDTM. The negative

impact in performance of the design is also attributed to the four-cycle latency for a

SRAM memory read. This latency induces wait states in the control structure causing the

design to run slower. The address modification would have not been necessary had the

WILDCARDTM system had larger memory banks and the original addressing scheme of

concatenating nodal values were still used.

The improvement over the PHYLIP software implementation goes as high as 50

times for a 64-taxa data set size. Beyond this size, the on-board memory address

modification becomes necessary, thus causing the design performance to deteriorate.

This can be seen as the 75-taxa data set size performance reduces to 40 times speed up

and far less for larger data set sizes, up to the maximum. This deterioration is attributed

to the low memory capability of the WILDCARDTM board and larger memory banks, if

available, should help scale the design more effectively. The size of the Virtex

XCV300E chip also inhibits us from the implementing parallel or pipelined design

architectures that might help in reducing latency of the memory addressing to a certain

extent.

Figure 24. Plotting the Performance Improvement over PHYLIP as Taxa Count Grows.

Figure 24 provides a plotted view of how the algorithm scales, as N grows large.

Given the maximum taxa data set at 256, we see that the performance deteriorates once

we encounter the increased overhead of memory address computation on data sets for

more than 64 taxa.

Thus, due to inherent limitations of the WILDCARDTM hardware board on which

our design executes, we could not obtain performance improvements that might

otherwise be obtained by applying custom logic/custom computing methodologies. A

larger board with a larger memory size would allow scaling for larger taxa counts and

certainly provide better performance over the software implementation in PHYLIP.

Figure 25. Plotting the Performance Difference as Taxa Count Grows.

However, if we look at the plot of the performance data itself, and compare the

two curves, we see that the performance improvement does indeed seem to scale, as the

performance curve for the PHYLIP implementation of UPGMA grows at a faster rate

than that of the implementation of UPGMA as a custom computing machine architecture.

Figure 26. Plotting the Performance Difference as Taxa Count Grows (Log Plot).

If we observe the trend as a logarithmic plot, we see that, for the peak

performance point for the custom computing implementation on the WILDCARDTM

(between 64 and 75 taxa), we are operating close to two-orders of magnitude faster than

the software PHYLIP implementation. Furthermore, we see that this improvement

decreases to a single order of magnitude—with an apparently decreasing trend in order-

of-magnitude performance difference as we grow to the limit of 256 taxa. This

corroborates the earlier plot showing a final 5X difference in performance between the

two implementations.

CHAPTER 7

SUMMARY AND CONCLUSIONS

7.1 Summary of Research Contributions

Custom computing systems built using reconfigurable logic devices, provide

several orders of magnitude speed-up in execution performance of algorithms over the

execution of these on conventional microprocessor-based systems. In addition, such

systems have the flexibility to program--and reprogram via reconfiguration--the actual

logic functions of the VLSI circuit with different applications in time and space. Custom

computing systems are implemented using FPGA custom-logic devices that are easily

and quickly programmed by an end-user. This research conducted design and analysis of

a custom computing application architecture for the UPGMA Bioinformatics algorithm

implemented on an FPGA-based custom-computing platform. We had looked at different

architectures of the design for the purpose of achieving better resource usage and also to

conform to the constraints of the hardware resources—most notably memory--on the

WILDCARDTM. We discussed the final architecture created and presented results of the

system performance, as measured and compared against that of the UPGMA algorithm

written in C, running on a single-processor Pentium® PC.

7.2 Conclusions

The results presented in Chapter 6 provided us with an insight into the

performance of both the hardware and software implementations. The hardware results

showed little variance for different permutations of a dataset for a given number of taxa.

The timing results also converge towards a mean value showing very little variance over

different datasets for a given number of taxa. The hardware results showed significant

improvement over the software implementation with performance peaking at the 64 taxa

datasets. For datasets of 74 taxa and above, the performance began to degrade

considerably, compared to that of the PHYLIP software implementation—although the

custom computing implementation was still between half- to a full-order of magnitude

faster. The hardware implementation was 50 times faster than the software

implementation for the 64–taxa datasets, indicating a reasonable performance

improvement, given the architectural limitations of memory addressing cited earlier.

We have also shown that, using predictive analysis in Excel, both the

implementations are bounded by functions that are time complexity of O(N3). The

polynomial equations generated for both the hardware and software performance curves

were of the order of N3 with a large difference being based on the coefficients and

constants of each time function. The predictive polynomial equation generated for the

software performance curve shown in Chapter 6 had large constant and coefficient values

compared to that of the polynomial equation of the hardware performance curve. These

predictive polynomial equations though do not represent actual values but do give us an

accurate estimation as to the behavior of both the implementations if we had been able to

scale beyond the limited number of 256 taxa.

The large values of the coefficients are indicative of the fact that software

implementation has underlying compile-related and operating system overhead that affect

its performance. The hardware implementation of the UPGMA algorithm avoids these

sources of overhead by its implementation of the computational units directly onto

FPGA-based hardware storage and functional units. This provides a considerable speed-

up, facilitating a higher-performance solution as is evidenced through the results

obtained.

However we see that the hardware performance degrades rapidly for datasets of

75 taxa and above. The performance degradation is attributed to the linear addressing

scheme used for the final architecture, and the latency for a single read from the

WILDCARDTM on-board SRAM memory banks. The read from the memory banks takes

4 clock cycles which adds additional wait states within the control structure that

negatively impacts the performance of the design. A linear addressing scheme was

employed to facilitate the implementation of taxa of 75 and above. The WILDCARDTM

memory provides us a maximum of 65536 word on each bank and this limitation forced

us to modify the address generated using the original addressing scheme into a linear

addressing scheme. The original addressing scheme would generate address values

greater than 65536 yet limits the number of taxa that could be implemented, even though

datasets of 256 taxa could be stored within the 65536 memory locations. The linear

addressing scheme enables the design to implement larger datasets up to the 256 taxa

limit, the maximum taxa size defined as a goal of this research. The address

modifications necessary for this purpose induces additional states in the control structure,

adversely impacting the performance of the design.

The original addressing scheme would require larger memory capabilities on the

hardware that the WILDCARDTM platform lacks. We discuss in the sections below how

larger memory banks--as well as certain architecture modifications--might improve the

performance.

7.3 Future Work

In the early sections we have looked at certain issues that hampered the

performance of the UPGMA design implemented in the WILDCARDTM system. We list

these issues below:

Memory size limitation and Memory address schemes

Latency for the memory read

Device size (FPGA resources)

We discuss these issues to see how alleviating these bottlenecks might be used to

increase performance.

7.3.1 Memory size and Memory address schemes

The WILDCARDTM system provides two memory banks with 65536 words on

each. The left memory bank was used for storing the distance matrix data and the right

memory for storing the tree output data. The limit of 65536 words on the left memory

necessitates address modification which degrades the design performance. We could

overcome this problem two ways:

Generating a better addressing scheme

Going for platforms with larger memory banks

The first option would give us a solution that could be implemented on the

currently available WILDCARDTM board, but is going to be a very difficult one as the

address modification is complex as described in Chapter 4. The second option is easier

and would require us to explore more custom computing platforms that offer larger

memory capabilities. Larger memory capabilities would eliminate the need for address

modifications and different addressing schemes.

The time taken to write and read to the memories on the WILDCARDTM board

from the host is also a significant factor in measuring the performance of the design. We

have seen that to write the entire memory bank takes 20 clock ticks on a 800 Mhz Intel

Pentium host system while to read an entire memory bank takes 100 clock ticks. We have

seen that the time taken increases linearly with increase in no of writes or no of reads.

Therefore to read the memory banks thrice the host would take 300 clock ticks and to

write thrice it would take 60 clock ticks. This linear increase would most certainly affect

the performance and we would need to look at other architectures that might provide

better performance in terms of reading and writing from the host.

7.3.2 Latency for a Memory Read

We have seen in the earlier sections that the memory read in the WILDCARDTM

system takes 4 cycles. This hurts the performance by slowing the operation of the design.

To overcome this we have to look other custom computing platforms that offer better

read and write cycle latency. This would remove the additional wait states induced into

the control structure and speed up the performance of the design.

7.3.3 Device size

The WILDCARDTM has a Xilinx Virtex XCV300E chip on it. The Virtex-E chip

has a total of 3072 slices. This is small compared to the Virtex II device, which has a

total of 33732 slices, offering much more space to implement larger designs and also

would enable us to look at different architectures of the algorithm under consideration,

namely, parallel or pipelined architectures. We have seen in the literature that, in general,

parallel architectures offer a very good performance improvement [22, 23].

The current implementation of the design takes up 60 percent of the

WILDCARDTM Virtex E chip. A parallel implementation would likely have multiple

copies of the design components, such as the datapath, control path, etc., running in

parallel. These multiple units would work on sub-parts of the distance matrix. This

parallel operation would speed up the design by a large extent but the multiple parallel

units would increase the design size, and there would be some penalty in the

communication overhead of the interacting subparts of the problem.

Thus, to implement a parallel architecture we would need a larger device or

multiple devices to ensure that we do not run out of resources. However the speed up in

performance that can be obtained is attractive enough that it definitely warrants an

exploration into the trade off between increased resources, parallelism versus

communication overhead, and the impact on computation speed. Therefore, future work

should investigate different custom computing architectures offering the requisite

resources to implement a parallel architecture of the UPGMA algorithm on a custom

computing fabric.

We have looked at different issues that caused problems in implementing the

UPGMA algorithm on the WILDCARDTM system and have also discussed how we could

be able to resolve these problems. The options suggested are presented as future work

that might be of great interest and might enable us in obtaining a performance

improvement for the UPGMA algorithm that conceivably could alter the upper bound of

the time complexity of the algorithm itself.

BIBLIOGRAPHY

[1] Andre DeHon, Wawrzynek, The case of reconfigurable processors. Berkeley

Reconfigurable Architectures Systems, and Software. University of California,

Berkeley. http://citeseer.nj.nec.com/dehon97case.html.

[2] Nick Tredennick, The case of reconfigurable computing. Micro Design

Resources, Microprocessor Report, Vol.10, No.10, Aug 1996.

[3] Stephen Brown and Jonathan Rose, Architecture of FPGAs and CPLDs: A

Tutorial, IEEE Design and Test of Computers, Vol. 13, No. 2, pp. 42-57, 1996.

[4] John V. Oldfield, Richard C. Dorf, System Implementation Strategies, Chapter 1,

Field Programmable Gate Arrays, Reconfigurable logic for Rapid Prototyping and

Implementation of Digital Systems, pg 1-26, Wiley-Interscience Publishing, 1995.

[5] Paul Graham, Brent Nelson, FPGA based Sonar processing. ACM/SIGDA

International Symposium for Field Programmable Gate Arrays. Pg 201-208.

February 1998. http://www.dynamicsilicon.com/Articles/Reconfigurable.pdf

[6] Jeffrey Arnold, Kenneth L. Pocek, Genetic Algorithms In Software and In

Hardware - A Performance Analysis of Workstation and Custom Computing

Machine Implementations, Proceedings of IEEE symposium of Field

Programmable Custom Computing Machines, pg 216-225, April 1996. IEEE

Computer Society.

[7] Jason R. Hess, David C. Lee, Scott J. Harper, Mark T. Jones, and Peter M.

Athanas, Implementation and Evaluation of a Prototype Reconfigurable Router,

http://www.dynamicsilicon.com/Articles/Reconfigurable.pdf

http://citeseer.nj.nec.com/dehon97case.html

Proceedings of IEEE symposium of Field Programmable Custom Computing

Machines, pg 44-50, April 1999. IEEE Computer Society.

[8] R. Petersen, B. L. Hutchings, An Assessment of the Suitability of FPGA-Based

Systems for Use in Digital Signal Processing, In 5th International Workshop on

Field Programmable Logic and Applications, pp 293-302, August 1995, Oxford,

England.

[9] P.W. Dowd, J.T. McHenry, F.A. Pellegrino, T.M. Carrozzi and W.B. Cocks, An

FPGA-Based Coprocessor for ATM Firewalls, Proceedings of the IEEE

Symposium on FPGA's for Custom Computing Machines (FCCM97), pg 30-39,

April 1997.

[10] Joe Felsenstein, PHYLIP source code, Department of Genome Sciences,

University of Washington, http://evolution.genetics.washington.edu/phylip.html

[11] R. Shamir, UPGMA, Tel Aviv University,

http://www.math.tau.ac.il/~rshamir/algmb/00/scribe00/html/lec08/node21.html

[12] Peter H. Weston, Michael D. Crisp, Introduction to Phylogenetic Systematics,

Invited Contributions of the Society of Australian Systematic Biologists, SASB,

http://www.science.uts.edu.au/sasb/WestonCrsip.html.

[13] James P. Davis, Peter J. Waddell, Sreesa Akella, Methods and Architectures for

Realizing Fast Phylogenetic Computation Engines Using VLSI Array Based

Logic, Submitted to IEEE Bioinformatics Conference, Aug, 2002.

[14] R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Building phylogenetic trees, Chapter

7, Biological Sequence Analysis, pg 160-190. Cambridge University Press, 1998.

http://www.science.uts.edu.au/sasb/WestonCrsip.html

http://www.math.tau.ac.il/~rshamir/algmb/00/scribe00/html/lec08/node21.html

http://evolution.genetics.washington.edu/phylip.html

[15] D.L. Swofford, G.J. Olsen, P.J. Waddell, and D.M. Hillis, Phylogenetic Inference,

Chapter 11, Molecular Systematics, pg 45-572, second edition, (ed. D.M. Hillis,

and C. Mortiz), Sinauer Association, Sunderland, MA, 1996.

[16] Xilinx Inc, Virtex-E 1.8V FPGA Complete Datasheet, March 2003

[17] Duncan A. Buell, Jeffrey M. Arnold, Walter J. Kleinfelder, SPLASH2 FPGAs in a

Custom Computing Machine, IEEE Computer Society Press, 1996.

[18] Annapolis Microsystems Inc, Annapolis WILDCARDTM System Reference Manual,

Revision 2.6, 2003. www.annapmicro.com

[19] SRC Computers Inc., www.srccomputers.com.

[20] StarBridge Systems, www.starbridgesystems.com

[21] Yutana Jawchinda, Hideaki Kobayashi, Quantifying Design Reuse: An HDL-

Based Design Experiment, International HDL Conference, April, 1999.

[22] H. J. Whitehouse, J. M. Speiser, K. Bromley, Signal Processing Applications of

Concurrent Array Processor Technology, Chapter 2, VLSI and Modern Signal

Processing, Prentice-Hall, Inc., 1985.

[23] Axelrod, R., The Complexity of Cooperation: Agent-Based Models of Competition

and Cooperation, Princeton University Press, 1997.

[24] Billsus, D., C. A. Brunk, C. Evans, B. Gladish and M. Pazzani, “Adaptive

Interfaces for Ubiquitous Web Access”, Communications of the ACM, Vol. 45,

No. 5, May 2002, pp. 34-38.

[25] Stanat, D. F. and D. F. McAllister, Discrete Mathematics in Computer Science,

Prentice Hall, Inc., 1977.

http://www.starbridgesystems.com/

http://www.srccomputers.com/

http://www.annapmicro.com/

APPENDIX A

VHDL SOURCE CODE

---------------------------------------------------------- Add, Subtract decrement modules needed for -- Address modification------------------------------------------------------------------------------------------------------------------ Author : Sreesa Akella-- File : add1.vhd-- Entity : add_1, sub_1, dec_1-- Architecture : add_1_beh, sub_1_beh, dec_1_beh--------------------------------------------------------

library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_signed.all;use ieee.std_logic_arith.all;

entity add_1 is port( in1 : in std_logic_vector(9 downto 0); in2 : in std_logic_vector(9 downto 0); opt : out std_logic_vector(9 downto 0) );end add_1;

architecture add_1_beh of add_1 is begin

opt <= in1 + in2;

end add_1_beh;


entity sub_1 is port( in1 : in std_logic_vector(9 downto 0); in2 : in std_logic_vector(9 downto 0); opt : out std_logic_vector(15 downto 0) );end sub_1;

architecture sub_1_beh of sub_1 is begin opt <= ("000000" & in1) - ("000000" & in2);end sub_1_beh;


entity dec_1 is port( inp : in std_logic_vector(15 downto 0); opt : out std_logic_vector(15 downto 0) );end dec_1;

architecture dec_1_beh of dec_1 is begin

opt <= inp - '1';

end dec_1_beh;

-------------------------------------------------------- Height Adder-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : adder_h.vhd-- Entity : adder_h-- Architecture : adder_h------------------------------------------------------


entity adder_h is port( Datainp1 : in std_logic_vector(15 downto 0); Datainp2 : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(15 downto 0) );end adder_h;

architecture adderh_beh of adder_h is begin

Data_out <= Datainp1 + Datainp2;

end adderh_beh;

-------------------------------------------------------- Adder module------------------------------------------------------------------------------------------------------------

-- Author : Sreesa Akella-- File : addernew.vhd-- Entity : adder-- Architecture : adder_beh------------------------------------------------------


entity adder is port( Datainp1 : in std_logic_vector(31 downto 0); Datainp2 : in std_logic_vector(31 downto 0); Data_out : out std_logic_vector(31 downto 0) );end adder;

architecture adder_beh of adder is begin

Data_out <= Datainp1 + Datainp2; end adder_beh;

-------------------------------------------------------- Address register entity architecture -------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : addr_dd.vhd-- Entity : addr_dd-- Architecture : addr_dd_beh------------------------------------------------------

library ieee;use ieee.std_logic_1164.all;

entity addr_dd isport(addr : in std_logic_vector(19 downto 0); reset : in std_logic; clk : in std_logic;

addr_dd_s : out std_logic_vector(19 downto 0));end addr_dd;

architecture addr_dd_beh of addr_dd is

signal temp, temp1 : std_logic_vector(19 downto 0);

begin process(clk, reset) begin if reset = '1' then

temp <= (others => '0'); temp1 <= (others => '0');

addr_dd_s <= (others => '0'); elsif clk = '1' and clk'event then temp <= addr; temp1 <= temp; addr_dd_s <= temp1; end if; end process;end addr_dd_beh;

-------------------------------------------------------- Height Adder Register entity architecture pair-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : haddregisterh.vhd-- Entity : adderhreg-- Architecture : addhreg_beh------------------------------------------------------

library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;

entity adderhreg isport(addout : in std_logic_vector(15 downto 0);

reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(15 downto 0));end adderhreg;

architecture addhreg_beh of adderhreg isbeginprocess(clk, reset)begin if reset = '1' then regval <= (others => '0'); elsif clk = '1' and clk'event then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= addout; end if; end if;end process;end addhreg_beh;

-------------------------------------------------------- Distance comparison unit entity architecture pair------------------------------------------------------------------------------------------------------------

-- Author : Sreesa Akella-- File : comparedst.vhd-- Entity : comparedst-- Architecture : comparedst_beh------------------------------------------------------


entity comparedst is port(Datainp1 : in std_logic_vector(31 downto 0); valid_dst : in std_logic; distreg_val : in std_logic_vector(31 downto 0); first_val : in std_logic; addr : in std_logic_vector(19 downto 0); distreginp : out std_logic_vector(31 downto 0); distreg_en : out std_logic; rowreginp : out std_logic_vector(9 downto 0); rowreg_en : out std_logic; colreginp : out std_logic_vector(9 downto 0); colreg_en : out std_logic; addr1_reg_en : out std_logic; addr2_reg_en : out std_logic);end comparedst;

architecture comparedst_beh of comparedst isbeginprocess(addr, first_val, Datainp1, distreg_val, valid_dst)begin if first_val = '1' and valid_dst = '1' then rowreginp <= addr(19 downto 10); colreginp <= addr(9 downto 0); distreginp <= Datainp1; rowreg_en <= '1'; colreg_en <= '1'; distreg_en <= '1'; addr1_reg_en <= '1'; addr2_reg_en <= '1'; else if Datainp1 < distreg_val and valid_dst = '1' then rowreginp <= addr(19 downto 10); colreginp <= addr(9 downto 0); distreginp <= Datainp1; rowreg_en <= '1'; colreg_en <= '1'; distreg_en <= '1'; addr1_reg_en <= '1'; addr2_reg_en <= '1'; else rowreginp <= (others => '0'); colreginp <= (others => '0'); distreginp <= (others => '0'); rowreg_en <= '0'; colreg_en <= '0'; distreg_en <= '0'; addr1_reg_en <= '0';

addr2_reg_en <= '0'; end if; end if;end process;end comparedst_beh;

-------------------------------------------------------- Controller entity architecture-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : controllerverfull9.vhd-- Entity : ctrl_blk-- Architecture : ctrl_beh------------------------------------------------------

library ieee;use ieee.std_logic_1164.all;entity ctrl_blk is

port(clk : in std_logic; reset : in std_logic; valid_numsp : in std_logic; addr_grt : in std_logic; child_cnt_gr : in std_logic; count_gr : in std_logic; all_nodes_done : in std_logic; initialized : in std_logic; div_valid : in std_logic; a_grt : in std_logic; ext_node : in std_logic; r_clr : out std_logic; r_inc : out std_logic; a_clr : out std_logic; a_inc : out std_logic; c2_read : out std_logic; mem_update : out std_logic; R_dec : out std_logic; Rp_dec : out std_logic; c1_incr : out std_logic; c2_incr : out std_logic; c1p_incr : out std_logic;

ch_incr : out std_logic; c1_load1 : out std_logic; c1_load2 : out std_logic; c2_load1 : out std_logic; c2_load2 : out std_logic; c1p_load : out std_logic; ch_load : out std_logic; c1_clr : out std_logic; c2_clr : out std_logic; c1p_clr : out std_logic; ch_clr : out std_logic; row_col_sel : out std_logic_vector(1 downto 0); addr2_reg_dec : out std_logic; node_write : out std_logic; read_mem : out std_logic;

write_mem : out std_logic; read_wmem : out std_logic; write_wmem : out std_logic; rowreg_clr : out std_logic; colreg_clr : out std_logic; distreg_clr : out std_logic; mulreg_en : out std_logic; mulreg_clr : out std_logic; addregwclr : out std_logic; addregwen : out std_logic; addregclr : out std_logic; addregen : out std_logic; divreg1clr : out std_logic; divreg1en : out std_logic; initial_run : out std_logic; store_cur_addr : out std_logic; node_mem_initialize : out std_logic; mem_initialize : out std_logic; addr_gen1_en : out std_logic; addr_gen2_en : out std_logic; rmem_read : out std_logic; rmem_write : out std_logic; ad_reg_en : out std_logic; ad_reg_clr : out std_logic; row_zero : out std_logic; numsp_val : out std_logic;

valid_td : out std_logic; nodeid_sel : out std_logic_vector(1 downto 0);

n_type_sel : out std_logic; incnt_inc : out std_logic; done : out std_logic );end ctrl_blk;

architecture ctrl_beh of ctrl_blk is

type state is(idle, wait_init, rmem_init, rmem_init1, node_mem_init, mem_init, wait_st, rmem_read_st, addr_mod1, addr_mod2, fetch_dst2, compare_dst, c2_inc_ld_st, wait_st1, addr2_gen_st, fetch_dst, wait_st2, add_dst, mul_dst, div_dst, wait_rmem, write_dist_to_mem, write_mem_wait, c2_incr_st, c2_read_st1, c2_read_st2, c2_read_st22, br_update1, br_update2, rmem_read_st2, addr_mod1_st, addr_mod2_st, tree_map_init1, tree_map_init2, tree_map_int, tree_map1, tree_map2, done_st);

signal cur_st : state;signal count_cycle : integer range 0 to 3;

beginprocess(clk, reset)beginif reset = '1' then store_cur_addr <= '0'; r_clr <= '0'; r_inc <= '0'; a_clr <= '0'; a_inc <= '0'; rmem_read <= '0';

rmem_write <='0'; ad_reg_en <= '0'; ad_reg_clr <= '0'; row_zero <= '0'; mem_initialize <= '0'; node_mem_initialize <= '0'; count_cycle <= 0; read_mem <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; addr_gen1_en <= '0'; addr_gen2_en <= '0'; c2_read <= '0'; mem_update <= '0'; R_dec <= '0'; Rp_dec <= '0'; c1_incr <= '0'; c2_incr <= '0'; c1p_incr <= '0'; ch_incr <= '0'; c1_load1 <= '0'; c1_load2 <= '0'; c2_load1 <= '0'; c2_load2 <= '0'; c1p_load <= '0'; ch_load <= '0'; c1_clr <= '0'; c2_clr <= '0'; c1p_clr <= '0'; ch_clr <= '0'; row_col_sel <= "00"; addr2_reg_dec <= '0'; node_write <= '0'; mulreg_en <= '0'; mulreg_clr <= '0'; addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1clr <= '0'; divreg1en <= '0'; numsp_val <= '0'; valid_td <= '0'; incnt_inc <= '0'; initial_run <= '0'; nodeid_sel <= "00"; n_type_sel <= '0'; done <= '0'; cur_st <= idle;elsif clk = '1' and clk'event thencase cur_st is when idle => done <= '0';

node_mem_initialize <= '0'; if valid_numsp = '1' then cur_st <= wait_init; c1p_incr <= '1'; R_dec <= '1'; Rp_dec <= '1';

row_zero <= '1'; rmem_write <= '1';

else cur_st <= idle; c1p_incr <= '0';

row_zero <= '0'; rmem_write <= '0';

end if; when wait_init => R_dec <= '0'; c1p_incr <= '0'; row_zero <= '0'; rmem_write <= '0'; a_inc <= '1'; cur_st <= rmem_init; when rmem_init => Rp_dec <= '0'; a_inc <= '1'; rmem_write <= '1'; if ext_node = '1' then

cur_st <= rmem_init1; row_zero <= '1'; ad_reg_en <= '0'; r_inc <= '1';

else row_zero <= '0'; ad_reg_en <= '1'; r_inc <= '0'; cur_st <= rmem_init;

end if; when rmem_init1 => row_zero <= '0'; if a_grt = '1' then rmem_write <= '0'; ad_reg_en <= '0'; ad_reg_clr <= '1'; a_inc <= '0'; r_inc <= '0'; a_clr <= '1'; r_clr <= '1'; write_wmem <= '1'; cur_st <= node_mem_init; else rmem_write <= '1'; ad_reg_en <= '1'; ad_reg_clr <= '0'; a_inc <= '1'; a_clr <= '0'; r_clr <= '0'; r_inc <= '1'; write_wmem <= '0';

cur_st <= rmem_init1; end if; when node_mem_init => Rp_dec <= '0'; c2_incr <= '0'; c1p_incr <= '0'; if initialized = '1' then node_mem_initialize <= '0'; write_wmem <= '0'; cur_st <= mem_init; mem_initialize <= '1'; c1_incr <= '0'; else node_mem_initialize <= '1'; write_wmem <= '1'; cur_st <= node_mem_init; mem_initialize <= '0'; c1_incr <= '1'; end if; when mem_init => Rp_dec <= '0'; mem_initialize <= '0'; read_mem <= '1'; node_write <= '0'; c1_incr <= '0'; c1p_incr <= '0'; c2_incr <= '1'; c2_clr <= '0'; r_dec <= '0'; rp_dec <= '0'; cur_st <= wait_st; when wait_st => read_mem <= '1'; store_cur_addr <= '1'; initial_run <= '1'; addr_gen1_en <= '1'; c2_incr <= '0'; cur_st <= rmem_read_st; when rmem_read_st => rmem_read <= '1'; store_cur_addr <= '0'; c2_incr <= '0'; c2_load1 <= '0'; c2_load2 <= '0'; addr_gen1_en <= '0'; cur_st <= addr_mod1; when addr_mod1 => rmem_read <= '0'; c2_incr <= '0'; c2_load1 <= '0'; c2_load2 <= '0'; cur_st <= addr_mod2; when addr_mod2 => cur_st <= fetch_dst2; when fetch_dst2 => read_mem <= '1'; node_write <= '0';

if count_gr = '1' then c2_load1 <= '1'; c2_load2 <= '1'; c2_incr <= '0'; else c2_load1 <= '0'; c2_load2 <= '0'; c2_incr <= '1'; end if; cur_st <= compare_dst; when compare_dst => read_mem <= '1'; c1_load1 <= '0'; c1_load2 <= '0'; c2_load1 <= '0'; c2_load2 <= '0'; c2_incr <= '0'; initial_run <= '0'; if (addr_grt = '1') then cur_st <= wait_st1; addr_gen1_en <= '0'; addr_gen2_en <= '0'; store_cur_addr <= '0'; else cur_st <= rmem_read_st; addr_gen1_en <= '1'; addr_gen2_en <= '0'; store_cur_addr <= '1'; end if; when wait_st1 => --wait for four clock cycles so that all data is operated upon if count_cycle < 2 then cur_st <= wait_st1; count_cycle <= count_cycle + 1; c1_load1 <= '0'; c2_load1 <= '0'; c1_clr <= '1'; c2_clr <= '1'; c1p_clr <= '1'; else cur_st <= tree_map_init1; count_cycle <= 0; c1_load1 <= '1'; c2_load1 <= '1'; c1_clr <= '0'; c2_clr <= '0'; end if; when br_update1 => mulreg_clr <= '0'; mulreg_en <='0'; addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0';

divreg1clr <= '0'; incnt_inc <= '0'; valid_td <= '0'; c2_read <= '0'; c1_incr <= '0'; if count_gr = '1' then cur_st <= c2_incr_st; c1_load2 <= '1'; c2_load2 <= '1'; mem_update <= '1'; c2_incr <= '0'; else cur_st <= c2_read_st1; c1_load2 <= '0'; c2_load2 <= '0'; mem_update <= '1'; c2_incr <= '1'; end if; when c2_read_st1 => c2_incr <= '0'; c1_incr <= '1'; c2_read <= '1'; mem_update <= '0'; cur_st <= br_update1; when c2_incr_st => mem_update <= '0'; c1_load2 <= '0'; c2_load2 <= '0'; c2_incr <= '1'; cur_st <= c2_read_st2; when c2_read_st2 => c2_incr <= '0'; c1_incr <= '0'; c2_read <= '1'; mem_update <= '0'; cur_st <= br_update2; when br_update2 => c2_read <= '0'; c1_incr <= '0'; if count_gr = '1' then cur_st <= addr2_gen_st; addr_gen2_en <= '0'; mem_update <= '0'; c2_incr <= '0'; c2_clr <= '1'; c1_clr <= '1'; row_col_sel <= "01"; else cur_st <= c2_read_st22; addr_gen2_en <= '0'; c2_incr <= '1'; mem_update <= '1'; row_col_sel <= "00"; end if; when c2_read_st22 => c2_incr <= '0'; c1_incr <= '1';

c2_read <= '1'; mem_update <= '0'; cur_st <= br_update2; when addr2_gen_st => addr_gen2_en <= '1'; c2_clr <= '0'; c1_clr <= '0'; c2_incr <= '0'; mem_update <= '0'; cur_st <= rmem_read_st2; when rmem_read_st2 => rmem_read <= '1'; addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1clr <= '0'; ch_incr <= '0'; c2_incr <= '0'; write_mem <= '0'; write_wmem <= '0'; mem_update <= '0'; cur_st <= addr_mod1_st; when addr_mod1_st => rmem_read <= '0'; cur_st <= addr_mod2_st; when addr_mod2_st => cur_st <= fetch_dst; when fetch_dst => addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1clr <= '0'; read_mem <= '1'; c2_incr <= '0'; write_mem <= '0'; write_wmem <= '0'; mem_update <= '0'; read_wmem <= '1'; addr_gen2_en <= '1'; rmem_read <= '0'; cur_st <= wait_st2; ch_incr <= '0'; when wait_st2 => --wait for four clock cycles for the data to arrive read_mem <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; addr_gen1_en <= '0';

if count_cycle < 2 then cur_st <= wait_st2; count_cycle <= count_cycle + 1; else cur_st <= mul_dst; count_cycle <= 0; end if; ch_incr <= '0'; when mul_dst => mulreg_en <= '1'; valid_td <= '0'; incnt_inc <= '0'; done <= '0'; addr_gen2_en <= '0'; ch_incr <= '0'; when add_dst => read_mem <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; mulreg_en <= '0'; addregwen <= '1'; addregwclr <= '0'; addregen <= '1'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1clr <= '0'; divreg1en <= '0'; numsp_val <= '0'; valid_td <= '0'; done <= '0'; incnt_inc <= '0'; if child_cnt_gr = '0' then cur_st <= rmem_read_st2; row_col_sel <= "10"; ch_incr <= '1'; else cur_st <= div_dst; row_col_sel <= "11"; ch_incr <= '1'; end if; addr_gen1_en <= '0'; addr_gen2_en <= '1'; when div_dst => ch_incr <= '0'; read_mem <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0';

colreg_clr <= '0'; divreg1clr <= '0'; numsp_val <= '0'; valid_td <= '0'; done <= '0'; incnt_inc <= '0'; if div_valid = '1' then cur_st <= wait_rmem; divreg1en <= '1'; rmem_read <= '1'; else cur_st <= div_dst; divreg1en <= '0'; rmem_read <= '0'; end if; when wait_rmem => rmem_read <= '0'; read_mem <= '0'; read_wmem <= '0'; divreg1en <= '0'; c2_incr <= '1'; cur_st <= write_dist_to_mem; when write_dist_to_mem => rmem_read <= '0'; read_mem <= '0'; write_mem <= '1'; read_wmem <= '0'; write_wmem <= '1'; addregwen <= '0'; addregwclr <= '1'; addregen <= '0'; addregclr <= '1'; mulreg_en <= '0'; mulreg_clr <= '1'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1en <= '0'; divreg1clr <= '0'; numsp_val <= '0'; valid_td <= '0'; done <= '0'; incnt_inc <= '0'; c2_incr <= '0'; cur_st <= write_mem_wait; when write_mem_wait => write_wmem <= '0'; write_mem <= '0'; addregwclr <= '0'; addregclr <= '0'; mulreg_clr <= '0'; c2_incr <= '0'; if count_gr = '1' then cur_st <= mem_init; addr_gen2_en <= '0'; node_write <= '1'; c1p_incr <= '1';

c1p_clr <= '0'; c2_clr <= '1'; r_dec <= '1'; rp_dec <= '1'; row_col_sel <= "00"; else addr_gen2_en <= '1'; cur_st <= rmem_read_st2; node_write <= '0'; c2_clr <= '0'; c1p_incr <= '0'; r_dec <= '0'; rp_dec <= '0'; row_col_sel <= "01"; end if; when tree_map_init1 => read_mem <= '0'; initial_run <= '0'; addr2_reg_dec <= '1'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; mulreg_en <= '0'; mulreg_clr <= '1'; addregwen <= '0'; addregwclr <= '1'; addregen <= '0'; addregclr <= '1'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1en <= '0'; divreg1clr <= '1'; numsp_val <= '0'; nodeid_sel <= "00"; incnt_inc <= '0'; n_type_sel <= '0'; cur_st <= tree_map_init2; valid_td <= '1'; c1_load1 <= '0'; c2_load1 <= '0'; c2_incr <= '1'; done <= '0'; when tree_map_init2 => c2_incr <= '0'; addr2_reg_dec <= '0'; read_mem <= '0'; write_mem <= '0'; initial_run <= '0'; read_wmem <= '0'; write_wmem <= '0'; addregwen <= '0'; addregwclr <= '1'; addregen <= '0'; addregclr <= '1'; distreg_clr <= '0'; rowreg_clr <= '0';

colreg_clr <= '0'; divreg1en <= '0'; divreg1clr <= '1'; numsp_val <= '0'; nodeid_sel <= "01"; n_type_sel <= '0'; valid_td <= '1'; done <= '0'; if all_nodes_done = '1' then cur_st <= tree_map2; incnt_inc <= '0'; c2_read <= '0'; else incnt_inc <= '1'; cur_st <= br_update1; c2_read <= '1'; end if; when tree_map2 => read_mem <= '0'; initial_run <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; addregwen <= '0'; addregwclr <= '1'; addregen <= '0'; addregclr <= '1'; divreg1clr <= '1'; divreg1en <= '0'; distreg_clr <= '1'; rowreg_clr <= '1'; colreg_clr <= '1'; numsp_val <= '0'; nodeid_sel <= "10"; n_type_sel <= '1'; valid_td <= '1'; done <= '0'; incnt_inc <= '0'; cur_st <= done_st; when others => read_mem <= '0'; initial_run <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; addregwen <= '0'; addregwclr <= '0'; addregclr <= '0'; divreg1clr <= '0'; divreg1en <= '0'; numsp_val <= '0'; valid_td <= '0'; incnt_inc <= '0'; done <= '1'; cur_st <= idle;end case;end if;

end process;end ctrl_beh;

-------------------------------------------------------- Counter Entity Architecture pairs-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : counter.vhd-- Entity : counter-- Architecture : counter_beh------------------------------------------------------


entity counter isport (Clk : in std_logic; Res : in std_logic;

ld : in std_logic; clr :in std_logic; inp : in std_logic_vector(9 downto 0); cv : in std_logic_vector(9 downto 0); inc : in std_logic; cnt : out std_logic_vector(9 downto 0); grt : out std_logic);end counter;

architecture counter_beh of counter is

signal count : std_logic_Vector(9 downto 0);

begin

process(clk, res)begin

if res = '1' then count <= (others => '0');

elsif clk = '1' and clk'event then if clr = '1' then count <= (others => '0'); elsif ld = '1' then count <= inp; elsif inc = '1' then

if count < cv then count <= unsigned(count) + '1'; else count <= (others => '0'); end if;

end if;end if;

end process;

process(res, count, cv)

beginif res = '1' then grt <= '0';

elsif count < cv then grt <= '0';

else grt <= '1'; end if;end process;

cnt <= count;

end counter_beh;

------------------------------------------------- child node counter-----------------------------------------------library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;

entity counterch isport (Clk : in std_logic; Res : in std_logic;

ld : in std_logic; clr :in std_logic; inp : in std_logic_vector(1 downto 0); inc : in std_logic; cnt : out std_logic_vector(1 downto 0); grt : out std_logic);end counterch;

architecture counterch_beh of counterch is

constant cv : std_logic_vector(1 downto 0) := "01";

signal count : std_logic_vector(1 downto 0);

begin

process(clk, res)begin

if res = '1' then count <= (others => '0');

elsif clk = '1' and clk'event then if clr = '1' then count <= (others => '0'); elsif ld = '1' then count <= inp; elsif inc = '1' then

if count < cv then count <= unsigned(count) + '1'; else count <= (others => '0');

end if; end if;end if;

end process;

process(res, count)begin

if res = '1' then grt <= '0';

elsif count < cv then grt <= '0';

else grt <= '1'; end if;end process;

cnt <= count;

end counterch_beh;

-------------------------------------------------------- Divider Entity - Architecture Pair-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : divider32new1.vhd-- Entity : divider-- Architecture : divider_beh------------------------------------------------------


entity divider isport(datainp1 : in std_logic_vector(31 downto 0);

divider : in std_logic_vector(15 downto 0); output : out std_logic_vector(31 downto 0); valid : out std_logic);end divider;

architecture divider_beh of divider isprocedure divide_proc (variable a, b : in unsigned (31 downto 0); variable r : out unsigned (63 downto 0); variable ov : out std_logic) is variable temp1 : unsigned (63 downto 0); variable temp2, temp3 : unsigned (32 downto 0); constant C0 : unsigned := "00000000000000000000000000000000"; -- constant zero constant C1 : unsigned := "00000000000000000000000000000001"; -- constant onebegin if (b = C0) then r(31 downto 0) := C0; r(63 downto 32) := C0;

ov := '1'; elsif (a = b) then r(31 downto 0) := C1; r(63 downto 32) := C0; ov := '0'; elsif (a < b) then r(31 downto 0) := C0; r(63 downto 32) := a; ov := '0'; else temp1(31 downto 0) := a; temp1(63 downto 32) := C0; temp3 := "0" & b; for i in 0 to 31 loop temp1(63 downto 1) := temp1(62 downto 0); temp1(0) := '0'; temp2 := "1" & temp1(63 downto 32); temp2 := temp2 - temp3; if (temp2(32) = '1') then temp1(0) := '1'; temp1(63 downto 32) := temp2(31 downto 0); end if; end loop; -- i r := temp1; ov := '0'; end if;end divide_proc;

begin process(datainp1, divider) variable a_inp, b_inp : unsigned(31 downto 0); variable r_sig : unsigned(63 downto 0); variable ov_sig : std_logic; begin a_inp := unsigned(datainp1); b_inp := unsigned("0000000000000000" & divider); divide_proc(a_inp, b_inp, r_sig, ov_sig); output <= std_logic_vector(r_sig(31 downto 0)); valid <= not ov_sig; end process;end divider_beh;

-------------------------------------------------------- Register for first_val signal-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : first_val_reg.vhd-- Entity : first_val_ddd-- Architecture : first_val_ddd_beh------------------------------------------------------library ieee;use ieee.std_logic_1164.all;

entity first_val_ddd is port(

first_val : in std_logic; reset : in std_logic; clk : in std_logic; first_val_ddd_s : out std_logic );end first_val_ddd;

architecture first_val_ddd_beh of first_val_ddd is

signal temp : std_logic; signal temp1 : std_logic;

begin

process(clk, reset) begin

if reset = '1' then temp <= '0'; temp1 <= '0'; first_val_ddd_s <= '0'; elsif (Rising_Edge( clk )) then temp <= first_val; temp1 <= temp; first_val_ddd_s <= temp1; end if; end process; end first_val_ddd_beh;


entity d_f is port(inp : in std_logic; reset : in std_logic; clk : in std_logic; opt : out std_logic);end d_f;

architecture d_f_beh of d_f isbegin process(Clk, Reset) begin if Reset = '1' then opt <= '0'; elsif (Rising_Edge( clk )) then opt <= inp; end if; end process;end d_f_beh;

-------------------------------------------------------- Height Adder Register entity architecture pair------------------------------------------------------

-------------------------------------------------------- Author : Sreesa Akella-- File : addregisterh.vhd-- Entity : adderhreg-- Architecture : addhreg_beh------------------------------------------------------


entity adderhreg isport(addout : in std_logic_vector(15 downto 0);

reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(15 downto 0));end adderhreg;

architecture addhreg_beh of adderhreg isbeginprocess(clk, reset)begin if reset = '1' then regval <= (others => '0'); elsif clk = '1' and clk'event then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= addout; end if; end if;end process;end addhreg_beh;

-------------------------------------------------------- Height Memory entity - architecture-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : hmemory.vhd-- Entity : hmemory-- Architecture : hmem_behave------------------------------------------------------

library ieee, xilinx_lib;use ieee.std_logic_1164.all;use xilinx_lib.VIRTEX.all;

entity hmemory is port ( Clk : in std_logic; Reset : in std_logic; Read : in std_logic; Write : in std_logic;

numsp : in std_logic_vector(9 downto 0); Addr : in std_logic_vector(9 downto 0); Data : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(15 downto 0) );end hmemory;

architecture hmem_behave of hmemory is

signal memAddr : integer; signal Write_En0 : std_logic; signal Write_En1 : std_logic; signal Write_En2 : std_logic; signal Write_En3 : std_logic; signal DataOut0 : std_logic_vector(15 downto 0 ); signal DataOut1 : std_logic_vector(15 downto 0 ); signal DataOut2 : std_logic_vector(15 downto 0 ); signal DataOut3 : std_logic_vector(15 downto 0 ); signal enable : std_logic; begin

--************************************** -- Write logic -- Write_En0 <= Write when (Addr(9 downto 8) = "00") else '0'; Write_En1 <= Write when (Addr(9 downto 8) = "01") else '0'; Write_En2 <= Write when (Addr(9 downto 8) = "10") else '0'; Write_En3 <= Write when (Addr(9 downto 8) = "11") else '0'; --************************************** -- Enable logic -- -- pretty simple or of Read or Write enable <= Read or Write; --************************************** -- Read Logic -- -- This can be combinational Data_out <= DataOut0 when Addr(9 downto 8 ) = "00" else DataOut1 when Addr(9 downto 8 ) = "01" else DataOut2 when Addr(9 downto 8 ) = "10" else DataOut3 when Addr(9 downto 8 ) = "11" else ( others => '0' );

--**************************************

-- Instantiate 4 256x16 BlockRAMS -- U_BR0 : RAMB4_S16 port map ( DI => Data, ADDR => Addr(7 downto 0), CLK => Clk, RST => Reset, WE => Write_En0, EN => enable, DO => DataOut0 ); U_BR1 : RAMB4_S16 port map ( DI => Data, ADDR => Addr(7 downto 0), CLK => Clk, RST => Reset, WE => Write_En1, EN => enable, DO => DataOut1 ); U_BR2 : RAMB4_S16 port map ( DI => Data, ADDR => Addr(7 downto 0), CLK => Clk, RST => Reset, WE => Write_En2, EN => enable, DO => DataOut2 ); U_BR3 : RAMB4_S16 port map ( DI => Data, ADDR => Addr(7 downto 0), CLK => Clk, RST => Reset, WE => Write_En3, EN => enable, DO => DataOut3 );

end hmem_behave;

-------------------------------------------------------- Multiplier entity-architecture

-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : mult.vhd-- Entity : mult-- Architecture : mult_beh------------------------------------------------------

library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use work.std_logic_prims.all;

entity mult isport(Datainp1 : in std_logic_vector(31 downto 0);

Datainp2 : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(31 downto 0));end mult;

architecture mult_beh of mult is beginprocess(Datainp1, Datainp2)variable inp1, inp2, outp : integer range 0 to 1000;begin inp1 := std_logic_vector_to_integer(Datainp1); inp2 := std_logic_vector_to_integer(Datainp2); outp := inp1 * inp2; Data_out <= integer_to_std_logic_vector(outp, 31); end process;end mult_beh;

-------------------------------------------------------- Multiplexer entity architecture pairs-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : mux.vhd-- Entity : mux2_1, mux3_1, mux2_1_16-- Architecture : mux2_1_behave, mux3_1_behave,-- mux_2_1_16_behave------------------------------------------------------library ieee;use ieee.std_logic_1164.all;

entity mux2_1 is port(inp1 : in std_logic_vector(9 downto 0); inp2 : in std_logic_vector(9 downto 0); sel : in std_logic; outp : out std_logic_vector(9 downto 0));end mux2_1;

architecture mux2_1_behave of mux2_1 isbegin outp <= inp1 when sel = '0' else inp2;end mux2_1_behave;


entity mux2_1_16 is port(inp1 : in std_logic_vector(15 downto 0); inp2 : in std_logic_vector(15 downto 0); sel : in std_logic; outp : out std_logic_vector(15 downto 0));end mux2_1_16;

architecture mux2_1_16behave of mux2_1_16 isbegin outp <= inp1 when sel = '0' else inp2;end mux2_1_16behave;


entity mux3_1 is port(inp1 : in std_logic_vector(9 downto 0); inp2 : in std_logic_vector(9 downto 0); inp3 : in std_logic_vector(9 downto 0); sel : in std_logic_vector(1 downto 0); outp : out std_logic_vector(9 downto 0));end mux3_1;

architecture mux3_1_behave of mux3_1 isbegin outp <= inp1 when sel = "01" else inp2 when sel = "10" else inp3;end mux3_1_behave;

-------------------------------------------------------- Register that stores number of species value-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : numspreg.vhd-- Entity : numofspreg-- Architecture : numofspreg_beh------------------------------------------------------


entity numofspreg is port( numsp : in std_logic_vector(9 downto 0); reset : in std_logic; clk : in std_logic; valid_in : in std_logic; valid : out std_logic; regval : out std_logic_vector(9 downto 0));

end numofspreg;

architecture numofspreg_beh of numofspreg isbegin

process(clk, reset) begin if reset = '1' then valid <= '0'; regval <= (others => '0'); elsif ( Rising_Edge( clk )) then if valid_in = '1' then regval <= numsp; Valid <= '1'; else Valid <= '0'; regval <= (others => '0'); end if; end if; end process;end numofspreg_beh;

-------------------------------------------------------- Output Selector and Address Generator-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : opt_sel_10.vhd-- Entity : opt_sel-- Architecture : opt_sel_beh------------------------------------------------------

library ieee, xilinx_lib;use ieee.std_logic_1164.all;use xilinx_lib.VIRTEX.all;use ieee.std_logic_arith.all;use work.std_logic_prims.all;

entity opt_sel is port(clk : in std_logic; reset : in std_logic; valid_numsp : in std_logic; numsp : in std_logic_vector(9 downto 0); row_reg : in std_logic_vector(9 downto 0); col_reg : in std_logic_vector(9 downto 0); store_cur_addr : in std_logic; node_mem_initialize : in std_logic; mem_initialize : in std_logic; addr_gen1_en : in std_logic; addr_gen2_en : in std_logic; c2_read : in std_logic; mem_update : in std_logic; addr1_reg_en : in std_logic; addr2_reg_en : in std_logic; R_dec : in std_logic; Rp_dec : in std_logic;

c1_incr : in std_logic; c2_incr : in std_logic; c1p_incr : in std_logic;

ch_incr : in std_logic; c1_load1 : in std_logic; c1_load2 : in std_logic; c2_load1 : in std_logic; c2_load2 : in std_logic; c1p_load : in std_logic;

ch_load : in std_logic; c1_clr : in std_logic;

c2_clr : in std_logic; c1p_clr : in std_logic; ch_clr : in std_logic;

row_col_sel : in std_logic_vector(1 downto 0); node_write : in std_logic; addr2_reg_dec : in std_logic;

distreg_val : in std_logic_vector(31 downto 0); nodeid_sel : in std_logic_vector(1 downto 0); n_type_sel : in std_logic; incnt_inc : in std_logic; initial_run : in std_logic; r_clr : in std_logic; r_inc : in std_logic; a_clr : in std_logic; a_inc : in std_logic; a_grt : out std_logic; ext_node : out std_logic; numsp_1 : out std_logic_vector(9 downto 0); addr_cnt : out std_logic_vector(9 downto 0); row_cnt : out std_logic_vector(9 downto 0); first_val : out std_logic; initialized : out std_logic; addr_grt : out std_logic; child_cnt_gr : out std_logic; count_gr : out std_logic; all_nodes_done : out std_logic; addr : out std_logic_vector(19 downto 0); cnt1 : out std_logic_vector(9 downto 0); nodeid : out std_logic_vector(9 downto 0); n_type : out std_logic; par : out std_logic_vector(9 downto 0); br_len : out std_logic_vector(15 downto 0));end opt_sel; architecture opt_sel_beh of opt_sel is

signal temp_next_int : integer range 0 to 512;

signal incnt : integer range 0 to 511;

signal numsp1, numsp2 : std_logic_vector(9 downto 0);signal numsp_int : integer range 0 to 511;signal valid_numsp_int : std_logic;

signal child_cnt : integer range 0 to 2;

signal max_node_cnt : integer range 0 to 511;

signal all_nd_done : std_logic;

------------------------------------------------------ Address generator signal declarations----------------------------------------------------

signal count1_in, c1_comp_val, count1 : std_logic_vector(9 downto 0);signal c1_grt, c1_inc, c1_load : std_logic;signal count1_16 : std_logic_vector(15 downto 0);signal countp_in, c1p_comp_val, count1p : std_logic_vector(9 downto 0);signal c1p_grt, c1p_inc : std_logic;

signal count2_in, c2_comp_val, count2 : std_logic_vector(9 downto 0);signal c2_grt, c2_grt_d, c2_grt_p, c2_load : std_logic;

signal chcount_in, ch_cnt : std_logic_vector(1 downto 0);signal ch_grt : std_logic;

signal addr1_reg, addr2_reg : std_logic_vector(9 downto 0);signal addr1_reg_d, addr2_reg_d : std_logic_vector(9 downto 0);signal addr1_reg_dd, addr2_reg_dd : std_logic_vector(9 downto 0);signal addr1_reg_ddd, addr2_reg_ddd : std_logic_vector(9 downto 0);signal addr1_reg_dddd, addr2_reg_dddd : std_logic_vector(9 downto 0);signal R, Rp : std_logic_vector(9 downto 0);

signal count2_in_sel : std_logic_vector(1 downto 0);

---------------------------------------------------------- Block ram signalssignal din_a, din_b, din_b1 : std_logic_vector(15 downto 0);signal ena, wea, enb, web : std_logic;signal addra, addrb : std_logic_vector(15 downto 0);signal addra0, addra1 : std_logic_vector(15 downto 0);signal addrb0, addrb1 : std_logic_vector(15 downto 0);signal addr_temp : std_logic_vector(9 downto 0);signal addr_grt_t, addr_grt_tt, addr_grt_ttt : std_logic;signal addr_t : std_logic_vector(19 downto 0);

signal wea0, wea1, web0, web1 : std_logic;signal ena0, ena1, enb0, enb1 : std_logic;------------------------------------------------------ Row memory map counters signals----------------------------------------------------signal r_cnt : std_logic_vector(9 downto 0);signal a_cval2 : integer range 0 to 511;signal a_cnt : std_logic_vector(9 downto 0);signal e_n : std_logic;------------------------------------------------------ Address generator component declarations----------------------------------------------------component counter

port(Clk : in std_logic; Res : in std_logic;

ld : in std_logic; clr :in std_logic; inp : in std_logic_vector(9 downto 0); cv : in std_logic_vector(9 downto 0); inc : in std_logic; cnt : out std_logic_vector(9 downto 0); grt : out std_logic);

end component;

component counterch port(Clk : in std_logic;

Res : in std_logic; ld : in std_logic; clr :in std_logic; inp : in std_logic_vector(1 downto 0); inc : in std_logic; cnt : out std_logic_vector(1 downto 0); grt : out std_logic);

end component;

component mux2_1 port(inp1 : in std_logic_vector(9 downto 0);

inp2 : in std_logic_vector(9 downto 0); sel : in std_logic; outp : out std_logic_vector(9 downto 0));

end component;

component mux2_1_16 port(inp1 : in std_logic_vector(15 downto 0);

inp2 : in std_logic_vector(15 downto 0); sel : in std_logic; outp : out std_logic_vector(15 downto 0));end component; component mux3_1 port(inp1 : in std_logic_vector(9 downto 0);

inp2 : in std_logic_vector(9 downto 0); inp3 : in std_logic_vector(9 downto 0); sel : in std_logic_vector(1 downto 0); outp : out std_logic_vector(9 downto 0));

end component;

begin

process(incnt)begin temp_next_int <= incnt + 1;end process;

process(clk, reset)begin if (reset = '1') then all_nodes_done <= '0'; all_nd_done <= '0'; elsif (clk = '1' and clk'event) then if temp_next_int > max_node_cnt then all_nodes_done <= '1' ;

all_nd_done <= '1'; else all_nodes_done <= '0'; all_nd_done <= '0'; end if; end if;end process;

process(clk, reset)begin if (reset = '1') then max_node_cnt <= 0; a_cval2 <= 0; elsif (clk = '1' and clk'event) then if valid_numsp_int = '1' then max_node_cnt <= (2* numsp_int); a_cval2 <= 2*numsp_int - 1; end if; end if;end process;

process(clk, reset)begin if (reset = '1') then numsp_int <= 0; valid_numsp_int <= '0'; numsp1 <= (others => '0'); elsif (clk = '1' and clk'event) then if valid_numsp = '1' then numsp1 <= integer_to_std_logic_vector(

(std_logic_vector_to_integer(numsp) - 1), 9); numsp_int <= std_logic_vector_to_integer(numsp) - 1; valid_numsp_int <= '1'; end if; end if;end process;

numsp_1 <= numsp1;

process(clk, reset)begin if (reset = '1') then incnt <= 0; elsif (clk = '1' and clk'event) then if valid_numsp = '1' then incnt <= std_logic_vector_to_integer(numsp); elsif incnt_inc = '1' then incnt <= incnt + 1; end if; end if;end process;

--opt sel processes

--n_type output select

process(n_type_sel)begin n_type <= n_type_sel;end process;

--nodeid output selectprocess(nodeid_sel, row_reg, col_reg, incnt)begin case nodeid_sel is when "00" => nodeid <= row_reg; when "01" => nodeid <= col_reg; when others => nodeid <= integer_to_std_logic_vector(incnt, 9); end case;end process;

--par selectprocess(incnt)begin par <= integer_to_std_logic_vector(incnt, 9);end process;

--br_len selectprocess(distreg_val)begin br_len <= '0' & distreg_val(15 downto 1);end process;

-- Signal indicating first value to be stored in least distance registerprocess(reset, clk)begin if reset = '1' then first_val <= '0'; elsif clk = '1' and clk'event then if (initial_run = '1') then first_val <= '1'; else first_val <= '0'; end if; end if;end process;

-------------------------------------------------------------------------------------Address Generator----------------------------------------------------------------------------------U0 : counter port map(Clk, Reset, c1_load, c1_clr, count1_in,

c1_comp_val, c1_inc, count1, c1_grt);U1 : counter port map(Clk, Reset, c1p_load, c1p_clr, countp_in,

c1p_comp_val, c1p_inc, count1p, c1p_grt);U2 : counter port map(Clk, Reset, c2_load, c2_clr, count2_in,

c2_comp_val, c2_incr, count2, c2_grt);u3 : counterch port map(Clk, Reset, ch_load, ch_clr, chcount_in,

ch_incr, ch_cnt, ch_grt);U4 : mux2_1 port map(addr2_reg, addr1_reg, c1_load1, count1_in);u5 : mux2_1 port map(Rp, R, node_mem_initialize, c1_comp_val);u6 : mux3_1 port map(addr2_reg, addr1_reg, count1p, count2_in_sel,

count2_in);u7 : mux2_1 port map(R, Rp, addr_gen2_en, c2_comp_val);u8 : mux2_1_16 port map(addrb, count1_16, node_mem_initialize, din_a);

count1_16 <= "000000" & count1;

count2_in_sel <= c2_load1&c2_load2;chcount_in <= "00";c1p_comp_val <= R;countp_in <= (others => '0');

c1_inc <= c1_incr or (c2_grt_p and not addr_gen2_en);c1p_inc <= c1p_incr or (c2_grt_p and not addr_gen2_en);

c1_load <= c1_load1 or c1_load2;c2_load <= c2_load1 or c2_load2;

--------------------------------------------------------------- Ena, Wea, Enb, and Web signals

ena <= node_mem_initialize or mem_initialize or addr_gen1_en or mem_update;wea <= node_mem_initialize or mem_update;enb <= mem_initialize or addr_gen1_en or addr_gen2_en or c2_read or node_write;web <= node_write; --c2_grt and addr_gen2_en;---------------------------------------------------------------- Dia and Dib input data

process (Clk, Reset)begin if Reset = '1' then din_b <= (others => '0'); elsif Clk = '1' and Clk'event then if incnt_inc = '1' then din_b <= integer_to_std_logic_vector(incnt, 15); end if; end if;end process;

--************************************** -- Write logic -- --

wea0 <= wea when (count1(9 downto 8) = "00") else '0'; wea1 <= wea when (count1(9 downto 8) = "01") else '0'; web0 <= web when (count2(9 downto 8) = "00") else '0'; web1 <= web when (count2(9 downto 8) = "01") else '0';

--************************************** -- Enable logic

-- -- ena0 <= ena when (count1(9 downto 8) = "00") else '0'; ena1 <= ena when (count1(9 downto 8) = "01") else '0'; enb0 <= enb when (count1(9 downto 8) = "00") else '0'; enb1 <= enb when (count1(9 downto 8) = "01") else '0'; --************************************** -- Read Logic -- -- addra <= addra0 when count1(9 downto 8) = "00" else addra1 when count1(9 downto 8) = "01" else ( others => '0' ); addrb <= addrb0 when count2(9 downto 8) = "00" else addrb1 when count2(9 downto 8) = "01" else ( others => '0' );

u9 : RAMB4_S16_S16 port map ( ADDRA => count1(7 downto 0), DIA => din_a, WEA => wea0, CLKA => Clk, RSTA => Reset, ENA => ena0, DOA => addra0,

ADDRB => count2(7 downto 0), DIB => din_b, WEB => web0, CLKB => Clk, RSTB => Reset, ENB => enb0, DOB => addrb0 );

u10 : RAMB4_S16_S16 port map ( ADDRA => count1(7 downto 0), DIA => din_a, WEA => wea1, CLKA => Clk, RSTA => Reset, ENA => ena1,

DOA => addra1,

ADDRB => count2(7 downto 0), DIB => din_b, WEB => web1, CLKB => Clk, RSTB => Reset, ENB => enb1, DOB => addrb1 );

------------------------------------------- addr1_reg and addr2_reg

process(clk, reset)begin if Reset = '1' then addr1_reg <= (others => '0'); elsif Clk'event and Clk = '1' then if addr1_reg_en = '1' then addr1_reg <= addr1_reg_d; end if; end if;end process;

process (Clk, Reset)begin if Reset = '1' then addr1_reg_d <= (others => '0'); addr2_reg_d <= (others => '0'); elsif Clk = '1' and Clk'event then if store_cur_addr = '1' then addr2_reg_d <= count1; addr2_reg_d <= count2; end if; end if;end process;

process(clk, reset)begin if Reset = '1' then addr2_reg <= (others => '0'); elsif Clk'event and Clk = '1' then if addr2_reg_en = '1' then addr2_reg <= addr2_reg_d; elsif addr2_reg_dec = '1' then addr2_reg <= unsigned(addr2_reg) - 1; end if; end if;end process;

-------------------------------------------- R and Rp registers

process(Clk, Reset)begin if Reset = '1' then

R <= (others => '0'); elsif Clk = '1' and Clk'event then if valid_numsp = '1' then R <= numsp; --(9 downto 0); elsif R_dec = '1' then R <= unsigned(R) - '1'; end if; end if;end process;

process(Clk, Reset)begin if Reset = '1' then

Rp <= (others => '0'); elsif Clk = '1' and Clk'event then if valid_numsp = '1' then Rp <= numsp; --(9 downto 0); elsif Rp_dec = '1' then Rp <= unsigned(Rp) - '1'; end if; end if;end process;

process(clk, reset)begin if reset = '1' then c2_grt_d <= '0'; elsif clk'event and clk = '1' then c2_grt_d <= c2_grt; end if;end process;

c2_grt_p <= c2_grt and not(c2_grt_d);

addr <= addr_temp & addrb(9 downto 0) when addr_gen2_en = '1' else addra(9 downto 0) & addrb(9 downto 0) ;

-- addr_temp should be setaddr_temp <= row_reg when row_col_sel = "01" else col_reg when row_col_sel = "10" else din_b(9 downto 0); count_gr <= c2_grt or all_nd_done;

child_cnt_gr <= ch_grt;

cnt1 <= count1;

process(clk, reset)begin if reset = '1' then addr_grt <= '0'; addr_grt_t <= '0'; addr_grt_tt <= '0';

addr_grt_ttt <= '0'; elsif Clk'event and Clk = '1' then addr_grt_t <= c1_grt; addr_grt_tt <= addr_grt_t; addr_grt_ttt <= addr_grt_tt; addr_grt <= addr_grt_ttt; end if;end process;

initialized <= c1_grt;

------------------------------------------------- counters for row map memory initialization-----------------------------------------------process(Clk, Reset)begin if Reset = '1' then r_cnt <= (others => '0'); elsif Clk'event and Clk = '1' then if r_clr = '1' then r_cnt <= (others => '0'); elsif r_inc = '1' then if r_cnt < R then r_cnt <= unsigned(r_cnt) + '1'; else r_cnt <= (others => '0'); end if; end if; end if;end process;

row_cnt <= r_cnt; process(Clk, Reset)begin if Reset = '1' then a_cnt <= (others => '0'); elsif Clk'event and Clk = '1' then if a_clr = '1' then a_cnt <= (others => '0'); elsif a_inc = '1' then if a_cnt < integer_to_std_logic_vector(a_cval2, 9) then a_cnt <= unsigned(a_cnt) + '1'; else a_cnt <= (others => '0'); end if; end if; end if;end process;

addr_cnt <= a_cnt;

process(Reset, a_cnt, a_cval2, numsp1)begin if Reset = '1' then e_n <= '0'; a_grt <= '0';

elsif a_cnt < numsp1 then e_n <= '0'; a_grt <= '0'; elsif a_cnt < integer_to_std_logic_vector(a_cval2, 9) then e_n <= '1'; a_grt <= '0'; else e_n <= '1'; a_grt <= '1'; end if;end process;

ext_node <= e_n;

end opt_sel_beh;

-------------------------------------------------------- Package for defining ieee std_logic primitives-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : pack.vhd-- Entity : NA-- Architecture : NA------------------------------------------------------


-- -- package for defining ieee std_logic primitives --package std_logic_prims is -- constant definitions constant bit_width : integer := 63; constant bit_widthx2 : integer := 127; -- -- function std_logic_vector_to_integer converts its -- std_logic_vector argument ibus, assumed to consist of '0' -- and '1' elements only, into an integer. -- function std_logic_vector_to_integer(ibus: in std_logic_vector) return integer; -- -- function integer_to_std_logic_vector converts its integer -- argument ival into a std_logic_vector of length n. -- function integer_to_std_logic_vector(val, n: in integer) return std_logic_vector;

end std_logic_prims;

package body std_logic_prims is -- -- function std_logic_vector_to_integer converts its -- std_logic_vector argument ibus, assumed to consist of '0'

-- and '1' elements only, into an integer. -- function std_logic_vector_to_integer(ibus: in std_logic_vector) return integer is variable result: integer := 0; begin for i in ibus'high downto 0 loop result := result * 2; if ibus(i) = '1' then result := result + 1; end if; end loop; return result; end std_logic_vector_to_integer; -- -- function integer_to_std_logic_vector converts its integer -- argument ival into a std_logic_vector of length n. -- function integer_to_std_logic_vector(val, n: in integer)return std_logic_vector is variable result: std_logic_vector(n downto 0); variable ival: integer := val; begin ival := val; for i in 0 to n loop if (ival mod 2) = 1 then result(i) := '1'; else result(i) := '0'; end if; ival := ival / 2; end loop;return result; end integer_to_std_logic_vector;

end std_logic_prims;

---------------------------------------------------------------------------- Entity : PE---- Architecture : pe_upgma_arch-- -- Author : Sreesa Akella---- Filename : pe_upgma_arch.vhd---- Description : PE architecture that implements the UPGMA design--------------------------------------------------------------------------

------------------------------- Glossary ----------------------------------- Name Key:-- =========-- _AS : Address Strobe

-- _CE : Clock Enable-- _CS : Chip Select-- _DS : Data Strobe-- _EN : Enable-- _OE : Output Enable-- _RD : Read Select-- _WE : Write Enable-- _WR : Write Select-- _d[d...] : Delayed (registered) signal (each 'd' denotes one-- level of delay)-- _n : Active low signals (must be last part of name)---- Port Name Dir Description-- ============================ === ================================-- Pads.Clocks.F_Clk I Frequency synthesizer clock-- Pads.Clocks.M_Clk I Memory clock-- Pads.Clocks.P_Clk I Processor clock-- Pads.Clocks.K_Clk I LAD-bus clock-- Pads.Clocks.IO_Clk I External I/O connector clock-- Pads.Clocks.M_Clk_Out_Pe O M_Clk to the PE-- Pads.Clocks.M_Clk_Out_CB_Ctrl O M_Clk to the CardBus controller-- Pads.Clocks.M_Clk_Out_Right_Mem O M_Clk to the right memory bank-- Pads.Clocks.M_Clk_Out_Left_Mem O M_Clk to the left memory bank-- Pads.Clocks.P_Clk_Out_Pe O P_Clk to the PE-- Pads.Clocks.P_Clk_Out_CB_Ctrl O P_Clk to the CardBus controller-- Pads.Reset I Global PE reset-- Pads.Audio O Pulse-width modulated audio pad-- Pads.LAD_Bus.Addr_Data B LAD-bus shared address/data bus-- Pads.LAD_Bus.AS_n I LAD-bus address strobe-- Pads.LAD_Bus.DS_n I LAD-bus data strobe-- Pads.LAD_Bus.Ack_n O LAD-bus acknowledge strobe-- Pads.LAD_Bus.Reg_n I LAD-bus register select-- Pads.LAD_Bus.WR_n I LAD-bus write select-- Pads.LAD_Bus.CS_n I LAD-bus chip select-- Pads.LAD_Bus.Int_Req_n O LAD-bus interrupt request-- Pads.LAD_Bus.DMA_0_Data_OK_n O LAD-bus DMA chan 0 data OK flag-- Pads.LAD_Bus.DMA_0_Burst_OK_n O LAD-bus DMA chan 0 burst OK flag-- Pads.LAD_Bus.DMA_1_Data_OK_n O LAD-bus DMA chan 1 data OK flag-- Pads.LAD_Bus.DMA_1_Burst_OK_n O LAD-bus DMA chan 1 burst OK flag-- Pads.LAD_Bus.Reg_Data_OK_n O LAD-bus reg space data OK flag-- Pads.LAD_Bus.Reg_Burst_OK_n O LAD-bus reg space burst OK flag-- Pads.LAD_Bus.Force_K_Clk_n O LAD-bus K_Clk forced-run select-- Pads.LAD_Bus.Reserved - Reserved for future use-- Pads.Left_Mem.Addr O Left memory address bus-- Pads.Left_Mem.Data B Left memory data bus-- Pads.Left_Mem.Byte_WR_n O Left memory byte write select-- Pads.Left_Mem.CS_n O Left memory chip select-- Pads.Left_Mem.CE_n O Left memory clock enable-- Pads.Left_Mem.WE_n O Left memory write enable-- Pads.Left_Mem.OE_n O Left memory output enable-- Pads.Left_Mem.Sleep_EN O Left memory sleep enable-- Pads.Left_Mem.Load_EN_n O Left memory load enable-- Pads.Left_Mem.Burst_Mode O Left memory burst mode select-- Pads.Right_Mem.Addr O Right memory address bus-- Pads.Right_Mem.Data B Right memory data bus-- Pads.Right_Mem.Byte_WR_n O Right memory byte write select-- Pads.Right_Mem.CS_n O Right memory chip select

-- Pads.Right_Mem.CE_n O Right memory clock enable-- Pads.Right_Mem.WE_n O Right memory write enable-- Pads.Right_Mem.OE_n O Right memory output enable-- Pads.Right_Mem.Sleep_EN O Right memory sleep enable-- Pads.Right_Mem.Load_EN_n O Right memory load enable-- Pads.Left_Mem.Burst_Mode O Right memory burst mode select-- Pads.Left_IO B Left external I/O connector-- Pads.Right_IO B Right external I/O connector--------------------------------------------------------------------------

-------------------------- Library Declarations ------------------------

library IEEE;use IEEE.std_logic_1164.all;use IEEE.std_logic_arith.all;use work.std_logic_prims.all;

library PE_Lib;use PE_Lib.PE_Package.all;

library LAD_Mux_Lib;use LAD_Mux_lib.LAD_Mux_Pkg.all;use LAD_Mux_lib.LAD_Mem32_Mux_pkg.all;

Library Mem_Mux_Lib;use Mem_Mux_lib.Mem32_Mux_pkg.all;

library DMA_Mux_Lib;use DMA_Mux_Lib.DMA_Mux_Pkg.all;use DMA_Mux_Lib.DMA_LAD_Mem32_Mux_Pkg.all;

------------------------ Architecture Declaration ----------------------

architecture pe_upgma_arch of PE is

------------------------------- Glossary ----------------------------- -- -- Name Key: -- ========= -- _AS : Address Strobe -- _CB : CardBus -- _CE : Clock Enable -- _CS : Chip Select -- _DS : Data Strobe -- _EN : Enable -- _OE : Output Enable -- _PE : Processing Element -- _RD : Read Select -- _WE : Write Enable -- _WR : Write Select -- _d[d...] : Delayed (registered) signal (each 'd' denotes one -- level of delay) -- _n : Active low signals (must be last part of name) -- -- Name Width Dir Description

-- ========================= ===== === ================================ -- Clocks_In.F_Clk 1 I Frequency synthesizer clock -- Clocks_In.M_Clk 1 I Memory clock -- Clocks_In.P_Clk 1 I Processing element clock -- Clocks_In.K_Clk 1 I LAD-bus clock -- Clocks_In.F_Clk_Locked 1 I U_Clk CLKDLL locked flag -- Clocks_In.M_Clk_Locked 1 I M_Clk CLKDLL locked flag -- Clocks_In.P_Clk_Locked 1 I P_Clk CLKDLL locked flag -- Global_Reset 1 I Global reset (or set) signal -- Audio_Out 1 O Pulse-width modulated audio -- output -- LAD_Mux_Bus(x).Addr 20 I LAD bus DWORD address bus

input -- LAD_Mux_Bus(x).Write 1 I LAD bus write select -- LAD_Mux_Bus(x).Strobe 1 I LAD bus register access strobe -- LAD_Mux_Bus(x).Mem_Strobe 1 I LAD bus memory access strobe -- LAD_Mux_Bus(x).DMA_0_Strobe 1 I LAD bus DMA channel 0 access -- strobe -- LAD_Mux_Bus(x).DMA_1_Strobe 1 I LAD bus DMA channel 1 access -- strobe -- LAD_Mux_Bus(x).DMA_0_Done 1 I DMA CH0 Completed signal -- LAD_Mux_Bus(x).DMA_1_Done 1 I DMA CH1 Completed signal -- LAD_Mux_Bus(x).Reset 1 I LAD bus reset signal -- LAD_Mux_Bus(x).Data_In 32 I LAD bus data bus input -- LAD_Mux_Bus(x).Data_Out 32 O LAD bus data bus output -- LAD_Mux_Bus(x).Akk 1 O LAD bus transaction

acknowledge -- LAD_Mux_Bus(x).Int_Req 1 O LAD bus interrupt request -- LAD_Mux_Bus(x).DMA_0_Stat 2 O LAD bus DMA Channel 0 status

flags -- LAD_Mux_Bus(x).DMA_1_Stat 2 O LAD bus DMA Channel 0 status

flags -- -- Left_Mem_Mux(x).Addr 32 O Left on-board memory -- address bus -- Left_Mem_Mux(x).Write 1 O Left on-board memory write -- select -- Left_Mem_Mux(x).Data_Out 32 O Right on-board memory output -- data bus -- Left_Mem_Mux(x).Req 1 O Left on-board memory access -- request -- Left_Mem_Mux(x).Akk 1 O Left on-board memory access -- acknowledge -- Left_Mem_Mux(x).Data_In 32 I Left on-board memory input -- data bus -- Left_Mem_Mux(x).Data_Valid 1 I Left on-board memory valid -- read flag -- -- Right_Mem_Mux(x).Addr 32 O Right on-board memory -- address bus -- Right_Mem_Mux(x).Write 1 O Right on-board memory write -- select -- Right_Mem_Mux(x).Data_Out 32 O Right on-board memory output -- data bus -- Right_Mem_Mux(x).Req 1 O Right on-board memory access -- request

-- Right_Mem_Mux(x).Akk 1 O Right on-board memory access -- acknowledge -- Right_Mem_Mux(x).Data_In 32 I Right on-board memory input -- data bus -- Right_Mem_Mux(x).Data_Valid 1 I Right on-board memory valid -- read flag -- -- Left_IO_In.Data_In 13 I Left I/O connector data -- input -- Left_IO_Out.Data_Out 13 O Left I/O connector data -- output -- Left_IO_Out.Data_OE_n 13 O Left I/O connector data -- output enable -- Right_IO_In.Data_In 13 I Right I/O connector data -- input -- Right_IO_Out.Data_Out 13 O Right I/O connector data -- output -- Right_IO_Out.Data_OE_n 13 O Right I/O connector data -- output enable -- ----------------------------------------------------------------------

---------------------------------------------------------------------- -- -- Below are all of the standard PE pad interface signals. Simply -- uncomment the signal(s) that are needed by the PE design. All -- other unused signals may remain commented out. Be sure to -- uncomment any component instances used by the interface. -- ----------------------------------------------------------------------

signal Clocks_In : Clock_Std_IF_In_Type; signal Global_Reset : Reset_Std_IF_In_Type := '0'; -- signal Audio_Out : Audio_Std_IF_Out_Type; -- signal Left_IO_In : IO_Conn_Std_IF_In_Type; -- signal Left_IO_Out : IO_Conn_Std_IF_Out_Type; -- signal Right_IO_In : IO_Conn_Std_IF_In_Type; -- signal Right_IO_Out : IO_Conn_Std_IF_Out_Type;

---------------------------------------------------------------------- -- -- Below are all of the multiplexing PE pad interface signals. Simply -- uncomment the signal(s) that are needed by the PE design and -- increase the vector sizes as needed. All other unused signals may -- remain commented out. Be sure to uncomment any component -- instances used by the interface. -- ----------------------------------------------------------------------

signal LAD_Mux_Bus : LAD_Mux_vector(0 to 2); signal Left_Mem_Mux : Mem32_Mux_vector(0 to 1); signal Right_Mem_Mux : Mem32_Mux_vector(0 to 1); signal LAD_Regs : LAD_Mux_register_vector(0 to 1);

---------------------------------------------------------------------- -- Component declaration of UPGMA top component

---------------------------------------------------------------------- component upgma_top is port( clk : in std_logic; Reset : in std_logic; Data_in : in std_logic_vector(31 downto 0); valid_numsp : in std_logic; valid_dst : in std_logic; addr_cnt : out std_logic_vector(9 downto 0); row_cnt : out std_logic_vector(9 downto 0); ext_node : out std_logic; rmem_read : out std_logic; rmem_write : out std_logic; ad_reg_en : out std_logic; ad_reg_clr : out std_logic; row_zero : out std_logic; mem_addr : out std_logic_vector(19 downto 0); read_mem : out std_logic; write_mem : out std_logic; numsp_1 : out std_logic_vector(9 downto 0); numsp : in std_logic_vector(9 downto 0); avg_dst : out std_logic_vector(31 downto 0); trout : out std_logic_vector(36 downto 0); valid_td : out std_logic; done : out std_logic ); end component;

---------------------------------------------------------------------- -- UPGMA top component signal declarations ----------------------------------------------------------------------

signal Data_in : std_logic_vector(31 downto 0); signal valid_numsp : std_logic; signal valid_dst : std_logic; signal mem_addr, mem_addr_t : std_logic_vector(19 downto 0); signal read_mem : std_logic; signal write_mem : std_logic; signal numsp : std_logic_vector(9 downto 0); signal avg_dst : std_logic_vector(31 downto 0); signal trout : std_logic_vector(36 downto 0); signal valid_td : std_logic; signal done : std_logic; signal val_nsp_d : std_logic; signal val_nsp_dd : std_logic; signal done_d : std_logic; signal rmem_read : std_logic; signal rmem_write : std_logic; signal ad_reg_en : std_logic; signal ad_reg_clr : std_logic; signal row_zero : std_logic; signal ext_node : std_logic; signal row_cnt, addr_cnt : std_logic_vector(9 downto 0); signal numsp_1 : std_logic_vector(9 downto 0); ---------------------------------------------------------------------- -- Address modification component

---------------------------------------------------------------------- component addr_mod port (Clk : in std_logic; Reset : in std_logic; addr_cnt : in std_logic_vector(9 downto 0); row_cnt : in std_logic_vector(9 downto 0); mem_addr : in std_logic_vector(19 downto 0);

numsp : in std_logic_vector(9 downto 0); numsp_1 : in std_logic_vector(9 downto 0);

ext_node : in std_logic; rmem_read : in std_logic; rmem_write : in std_logic; ad_reg_en : in std_logic; ad_reg_clr : in std_logic;

row_zero : in std_logic; addr_modif : out std_logic_vector(15 downto 0)); end component; ---------------------------------------------------------------------- -- Address modification component signals ---------------------------------------------------------------------- signal addr_modif : std_logic_vector(15 downto 0);

---------------------------------------------------------------------- -- Memory input signals -- Left memory and right memory input signals ---------------------------------------------------------------------- signal left_mem_data_Addr : std_logic_vector(31 downto 0); signal left_mem_data_Write : std_logic; signal left_mem_data_Data_Out : std_logic_vector(31 downto 0); signal left_mem_data_Req : std_logic; signal right_mem_data_Addr : std_logic_vector(31 downto 0); signal right_mem_data_Write : std_logic; signal right_mem_data_Data_Out : std_logic_vector(31 downto 0); signal right_mem_data_Req : std_logic; signal en : std_logic;begin------------------------------------------------------------------------------- The following two components create a block RAM bridge from the LAD-- bus to the onboard left and right memories. Use the -- LAD_Mem_Bridge.c/.h source files to write data to these components --- from the host.---- Each component needs a unique LAD_Mux_Bus, and either a unique-- Left_Mem_Mux or Right_Mem_Mux.---- The Left Memory is located at address 0x1000 and the Right Memory-- at address 0x1200.---- Physically these addresses come into the PE as 0x5000 and 0x5200,-- however the 0x4000 REGISTER base is subtracted from the Address

-- if the USE_OLD_ADDRESSES generic is FALSE. This new address scheme-- was added to make the WC_PeRegRead and WC_PeRegWrite addresses match-- the addresses in the VHDL code.------------------------------------------------------------------------------

U_Left_Bridge : LAD_Mem32_Bridge generic map ( BASE => x"1000" ) port map ( Kclk => Clocks_In.K_clk, LAD => LAD_Mux_Bus(1),

Mclk => Clocks_In.M_Clk, Mem => Left_Mem_Mux(0) );

U_Right_Bridge : LAD_Mem32_Bridge generic map ( BASE => x"1200" ) port map ( Kclk => Clocks_In.K_clk, LAD => LAD_Mux_Bus(0),

Mclk => Clocks_In.M_Clk, Mem => Right_Mem_Mux(0) );

---------------------------------------------------------------------------- -- -- Instantiated a LAD_Mux_Register file of size 1 -- This single 32 bit register would store the value of the numsp -- ---------------------------------------------------------------------------- U_LAD_Mux_Reg : LAD_Mux_RegFile generic map ( BASE => x"2000", L2NUM => 1 ) port map ( Kclk => Clocks_In.K_Clk, LAD => LAD_Mux_Bus(2),

Regs => LAD_Regs );

-- Tie the register output to the input so it can be read back

LAD_Regs(0).Data_Out <= LAD_Regs(0).Data_In;

---------------------------------------------------------------------- -- -- Below are all of the standard PE pad interface components. Simply -- uncomment the interface(s) that are needed by the PE design. All -- other unused interfaces may remain commented out. Be sure to -- uncomment any signal declarations used by the interface. -- ----------------------------------------------------------------------

-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@-- --@@-- --@@ CLOCK STANDARD Interface. Uncomment this component-- --@@ to use K,M,P and F clocks. (This should almost-- --@@ always be uncommented.)-- --@@-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@-- U_Clocks : Clock_Std_IF generic map ( USE_EXT_P_CLK_SOURCE => FALSE, REVISION => REVD ) port map ( Global_Reset => Global_Reset, Pads => Pads.Clocks, User_In => Clocks_In );

--@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ LAD MUX INTERFACE --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

U_LAD_MUX : LAD_Mux_IF generic map ( USE_OLD_ADDRESSES => FALSE ) port map ( Kclk => Clocks_In.K_Clk, Reset => Global_Reset, Pads => Pads.LAD_Bus, Clients => LAD_Mux_Bus );

--@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ LEFT MEMORY MUX INTERFACE : The two interfaces below --@@ are mutually exclusive. Uncomment either the --@@ Mem32_Mux_Priority_IF or the Mem32_Mux_Fair_IF --@@ to use the left memory bank. --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

U_Left_Mem_Mux : Mem32_Mux_Priority_IF generic map ( AVOID_OVERFLOW => TRUE, NUM_AKK_FIFOS => 0, ROTATE_PRIORITY => FALSE, STICKY_PRIORITY => FALSE, REGISTERED_AKKS => FALSE, REGISTERED_REQS => FALSE ) port map ( Mclk => Clocks_In.M_Clk, Reset => Global_Reset, Pads => Pads.Left_Mem, Clients => Left_Mem_Mux );

-- U_Left_Mem_Mux : Mem32_Mux_Fair_IF-- generic map-- (-- AVOID_OVERFLOW => TRUE,-- REGISTER_DATA => FALSE-- )-- port map-- (-- Mclk => Clocks_In.M_Clk,-- Reset => Global_Reset,-- Pads => Pads.Left_Mem.-- Clients => Left_Mem_Mux-- );---- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ RIGHT MEMORY MUX INTERFACE : The two interfaces below --@@ are mutually exclusive. Uncomment either the --@@ Mem32_Mux_Priority_IF or the Mem32_Mux_Fair_IF --@@ to use the right memory bank. --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

U_Right_Mem_Mux_IF : Mem32_Mux_Priority_IF generic map (

AVOID_OVERFLOW => TRUE, NUM_AKK_FIFOS => 0, ROTATE_PRIORITY => FALSE, STICKY_PRIORITY => FALSE, REGISTERED_AKKS => FALSE, REGISTERED_REQS => FALSE ) port map ( Mclk => Clocks_In.M_Clk, Reset => Global_Reset, Pads => Pads.Right_Mem, Clients => Right_Mem_Mux );

-- U_Right_Mem_Mux : Mem32_Mux_Fair_IF-- generic map-- (-- AVOID_OVERFLOW => TRUE,-- REGISTER_DATA => FALSE-- )-- port map-- (-- Mclk => Clocks_In.M_Clk,-- Reset => Global_Reset,-- Pads => Pads.Right_Mem.-- Clients => Right_Mem_Mux-- );-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ RESET INTERFACE : The following component provides --@@ a global reset to the entire PE. The Global_Reset --@@ signal is also tied to the GSR port of the --@@ STARTUP VIRTEX. This component should almost --@@ always be uncommented. --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

U_Reset : Reset_Std_IF port map ( Clk => Clocks_In.K_Clk, Pads => Pads.Reset, User_In => Global_Reset );

--@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ UPGMA top component: The following component --@@ reads the distance data from the left and --@@ right memories and reconstructs the phylogene- --@@ -tic tree using the UPGMA algorithm --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ U_UPGMA_TOP : upgma_top port map (

clk => Clocks_In.P_Clk, Reset => Global_Reset, Data_in => Data_in, valid_numsp => valid_numsp, valid_dst => valid_dst, addr_cnt => addr_cnt, row_cnt => row_cnt, ext_node => ext_node, rmem_read => rmem_read, rmem_write => rmem_write, ad_reg_en => ad_reg_en, ad_reg_clr => ad_reg_clr, row_zero => row_zero, mem_addr => mem_addr, read_mem => read_mem, write_mem => write_mem, numsp_1 => numsp_1, numsp => numsp, avg_dst => avg_dst, trout => trout, valid_td => valid_td, done => done ); --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Data_in and --@@ Valid numsp assignment --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ valid_numsp <= val_nsp_d and not val_nsp_dd; process ( Global_Reset, Clocks_In.P_Clk ) begin if ( Global_Reset = '1' ) then numsp <= (others => '0'); val_nsp_d <= '0'; val_nsp_dd <= '0'; elsif ( rising_edge ( Clocks_In.P_Clk ) ) then val_nsp_d <= '0'; if (LAD_Regs(0).Strobe = '1') then numsp <= LAD_Regs(0).Data_in( 9 downto 0 ); val_nsp_d <= '1'; end if; val_nsp_dd <= val_nsp_d; end if; end process;

process( Global_Reset, Clocks_In.K_Clk) begin

if ( Global_Reset = '1' ) then done_d <= '0'; elsif ( rising_edge ( Clocks_In.K_Clk ) ) then done_d <= done; end if; end process; --------------------------------------------------------- -- -- Data_in -- --------------------------------------------------------- process ( Global_Reset, Clocks_In.P_Clk ) begin if ( Global_Reset = '1' ) then valid_dst <= '0'; Data_in <= (others => '0'); elsif ( rising_edge ( Clocks_In.P_Clk ) ) then if (Left_Mem_Mux(1).Data_Valid = '1') then valid_dst <= Left_Mem_Mux(1).Data_Valid; Data_in <= Left_Mem_Mux(1).Data_in; end if; end if; end process; --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Store the done signal in PE registers --@@ for the host to poll and read --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

process ( Global_Reset, Clocks_In.K_Clk ) begin if ( Global_Reset = '1' ) then

LAD_Regs(1).Data_out <= (others => '0'); elsif ( rising_edge ( Clocks_In.K_Clk ) ) then if (done_d = '1') then LAD_Regs(1).Data_out <= "00000000000000000000000000000001"; end if; end if; end process; --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Address modification component instantiation --@@ --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

U_addr_mod : addr_mod port map (Clocks_In.P_Clk, Global_Reset,

addr_cnt, row_cnt, mem_addr,

numsp, numsp_1, ext_node,

rmem_read, rmem_write, ad_reg_en, ad_reg_clr,

row_zero, addr_modif ); --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Assign left_mem_data and right_mem_data --@@ with data from UPGMA design - avg_dst and --@@ tree_data respectively --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ process(Global_Reset, Clocks_In.M_Clk) variable temp : integer; begin if ( Global_Reset = '1' ) then left_mem_data_Addr <= (others => '0'); left_mem_data_Write <= '0'; left_mem_data_Data_Out <= (others => '0'); left_mem_data_Req <= '0'; temp := 0; elsif ( rising_edge ( Clocks_In.M_Clk ) ) then temp := std_logic_vector_to_integer(addr_modif); left_mem_data_Addr <= integer_to_std_logic_vector(temp, 31); left_mem_data_Write <= write_mem; left_mem_data_Data_Out <= avg_dst; left_mem_data_Req <= (write_mem or read_mem ); end if; end process; process(Global_Reset, Clocks_In.M_Clk) begin if ( Global_Reset = '1' ) then right_mem_data_Addr <= (others => '0'); right_mem_data_Write <= '0'; right_mem_data_Data_Out <= (others => '0'); right_mem_data_Req <= '0'; elsif ( rising_edge ( Clocks_In.M_Clk ) ) then right_mem_data_Addr <= "0000000000000000000000" & trout(35 downto 26); right_mem_data_Write <= valid_td; right_mem_data_Data_Out <= trout(35 downto 16)& trout(11 downto 0); right_mem_data_Req <= valid_td; end if;

end process; --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Assign left_mem_mux(1) and right_mem_mux(1) --@@ with left_mem_data and right_mem_data --@@ respectively --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ Left_Mem_Mux(1).Addr <= left_mem_data_Addr; Left_Mem_Mux(1).Write <= left_mem_data_Write; Left_Mem_Mux(1).Data_Out <= left_mem_data_Data_Out;-- when Left_Mem_Mux(1).Akk = '0' else Left_Mem_Mux(1).Data_out; Left_Mem_Mux(1).Req <= left_mem_data_Req; Right_Mem_Mux(1).Addr <= Right_mem_data_Addr; Right_Mem_Mux(1).Write <= Right_mem_data_Write; Right_Mem_Mux(1).Data_Out <= Right_mem_data_Data_Out;-- when Right_Mem_Mux(1).Akk = '0' else Right_Mem_Mux(1).Data_out; Right_Mem_Mux(1).Req <= Right_mem_data_Req; -- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@-- --@@-- --@@ LEFT I/O CONNECTOR INTERFACE : The following-- --@@ component provides an interface to the left I/O-- --@@ connector on the WILDCARD(tm). Uncomment the-- --@@ inteface below to use the left I/O connector.-- --@@-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@---- U_Left_IO : IO_Conn_Std_IF-- port map-- (-- Pads => Pads.Left_IO,-- User_In => Left_IO_In,-- User_Out => Left_IO_Out-- );---- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@-- --@@-- --@@ RIGHT I/O CONNECTOR INTERFACE : The following-- --@@ component provides an interface to the right I/O-- --@@ connector on the WILDCARD(tm). Uncomment the-- --@@ inteface below to use the right I/O connector.-- --@@-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@---- U_Right_IO : IO_Conn_Std_IF-- port map-- (-- Pads => Pads.Right_IO,-- User_In => Right_IO_In,-- User_Out => Right_IO_Out-- );--

-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@-- --@@-- --@@ AUDIO INTERFACE : Uncomment the following-- --@@ interface to use the audio port.-- --@@-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@---- U_Audio : Audio_Std_IF-- port map-- (-- Clk => Clocks_In.K_Clk,-- Global_Reset => Global_Reset,-- Pads => Pads.Audio,-- User_Out => Audio_Out-- );

---------------------------------------------------------------------- -- NOTE : The following line must remain in all designs -- to ensure that all of the PE pads are driven. ---------------------------------------------------------------------- Init_PE_Pads ( Pads );

end architecture;

-------------------------------------------------------- 32-bit Register entity - architecture-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : reg.vhd-- Entity : reg-- Architecture : reg_behave------------------------------------------------------library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;

entity reg is port( data : in std_logic_vector(31 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(31 downto 0) );end reg;

architecture reg_beh of reg isbegin

process(clk, reset) begin if ( reset = '1' ) then regval <= (others => '0');

elsif (Rising_Edge ( clk )) then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= data; end if; end if; end process;

end reg_beh;

-------------------------------------------------------- 10-bit Register entity - architecture-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : reg2.vhd-- Entity : reg_2-- Architecture : reg2_beh------------------------------------------------------


entity reg_2 is port( data : in std_logic_vector(9 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(9 downto 0));end reg_2;

architecture reg2_beh of reg_2 isbegin

process(clk, reset) begin if ( reset = '1' ) then regval <= (others => '0'); elsif ( Rising_Edge( clk )) then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= data; end if; end if; end process;

end reg2_beh;

-------------------------------------------------------- 16-bit Register entity - architecture

-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : reg3.vhd-- Entity : reg_3-- Architecture : reg3_beh------------------------------------------------------


entity reg_3 is port( data : in std_logic_vector(15 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(15 downto 0));end reg_3;

architecture reg3_beh of reg_3 isbegin

process(clk, reset) begin if ( reset = '1' ) then regval <= (others => '0'); elsif ( Rising_Edge( clk )) then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= data; end if; end if; end process;

end reg3_beh;

-------------------------------------------------------- Top Component UPGMA entity - architecture-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : upgma_top.vhd-- Entity : upgma_top-- Architecture : upgma_struct------------------------------------------------------

library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use work.std_logic_prims.all;

entity upgma_top is port(clk : in std_logic;

Reset : in std_logic; Data_in : in std_logic_vector(31 downto 0); valid_numsp : in std_logic; valid_dst : in std_logic; addr_cnt : out std_logic_vector(9 downto 0); row_cnt : out std_logic_vector(9 downto 0); ext_node : out std_logic; rmem_read : out std_logic; rmem_write : out std_logic; ad_reg_en : out std_logic; ad_reg_clr : out std_logic; row_zero : out std_logic; mem_addr : out std_logic_vector(19 downto 0); read_mem : out std_logic; write_mem : out std_logic; numsp_1 : out std_logic_vector(9 downto 0); numsp : in std_logic_vector(9 downto 0); avg_dst : out std_logic_vector(31 downto 0); trout : out std_logic_vector(36 downto 0); valid_td : out std_logic; done : out std_logic);end upgma_top;

architecture upgma_struct of upgma_top is

component adder port(Datainp1 : in std_logic_vector(31 downto 0); Datainp2 : in std_logic_vector(31 downto 0); Data_out : out std_logic_vector(31 downto 0));end component;

component adder_w port(Datainp1 : in std_logic_vector(15 downto 0); Datainp2 : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(15 downto 0));end component;

component adderwreg port(addout : in std_logic_vector(15 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(15 downto 0));end component;

component ctrl_blk port(clk : in std_logic; reset : in std_logic; valid_numsp : in std_logic; addr_grt : in std_logic; child_cnt_gr : in std_logic; count_gr : in std_logic; all_nodes_done : in std_logic; initialized : in std_logic; div_valid : in std_logic;

a_grt : in std_logic; ext_node : in std_logic; r_clr : out std_logic; r_inc : out std_logic; a_clr : out std_logic; a_inc : out std_logic; c2_read : out std_logic; mem_update : out std_logic; R_dec : out std_logic; Rp_dec : out std_logic; c1_incr : out std_logic; c2_incr : out std_logic; c1p_incr : out std_logic;

ch_incr : out std_logic; c1_load1 : out std_logic; c1_load2 : out std_logic; c2_load1 : out std_logic; c2_load2 : out std_logic; c1p_load : out std_logic; ch_load : out std_logic; c1_clr : out std_logic; c2_clr : out std_logic; c1p_clr : out std_logic; ch_clr : out std_logic; row_col_sel : out std_logic_vector(1 downto 0); addr2_reg_dec : out std_logic; node_write : out std_logic; read_mem : out std_logic; write_mem : out std_logic; read_wmem : out std_logic; write_wmem : out std_logic; rowreg_clr : out std_logic; colreg_clr : out std_logic; distreg_clr : out std_logic; mulreg_en : out std_logic; mulreg_clr : out std_logic; addregwclr : out std_logic; addregwen : out std_logic; addregclr : out std_logic; addregen : out std_logic; divreg1clr : out std_logic; divreg1en : out std_logic; initial_run : out std_logic; store_cur_addr : out std_logic; node_mem_initialize : out std_logic; mem_initialize : out std_logic; addr_gen1_en : out std_logic; addr_gen2_en : out std_logic; rmem_read : out std_logic; rmem_write : out std_logic; ad_reg_en : out std_logic; ad_reg_clr : out std_logic;

row_zero : out std_logic; numsp_val : out std_logic;

valid_td : out std_logic; nodeid_sel : out std_logic_vector(1 downto 0);

n_type_sel : out std_logic;

incnt_inc : out std_logic; done : out std_logic ); end component;

component opt_sel port(clk : in std_logic; reset : in std_logic; valid_numsp : in std_logic; numsp : in std_logic_vector(9 downto 0); row_reg : in std_logic_vector(9 downto 0); col_reg : in std_logic_vector(9 downto 0); store_cur_addr : in std_logic; node_mem_initialize : in std_logic; mem_initialize : in std_logic; addr_gen1_en : in std_logic; addr_gen2_en : in std_logic; c2_read : in std_logic; mem_update : in std_logic; addr1_reg_en : in std_logic; addr2_reg_en : in std_logic; R_dec : in std_logic; Rp_dec : in std_logic; c1_incr : in std_logic; c2_incr : in std_logic; c1p_incr : in std_logic;

ch_incr : in std_logic; c1_load1 : in std_logic; c1_load2 : in std_logic; c2_load1 : in std_logic; c2_load2 : in std_logic; c1p_load : in std_logic;

ch_load : in std_logic; c1_clr : in std_logic;

c2_clr : in std_logic; c1p_clr : in std_logic; ch_clr : in std_logic;

row_col_sel : in std_logic_vector(1 downto 0); addr2_reg_dec : in std_logic;

node_write : in std_logic; distreg_val : in std_logic_vector(31 downto 0); nodeid_sel : in std_logic_vector(1 downto 0); n_type_sel : in std_logic; incnt_inc : in std_logic; initial_run : in std_logic; r_clr : in std_logic; r_inc : in std_logic; a_clr : in std_logic; a_inc : in std_logic; a_grt : out std_logic; ext_node : out std_logic; numsp_1 : out std_logic_vector(9 downto 0); addr_cnt : out std_logic_vector(9 downto 0); row_cnt : out std_logic_vector(9 downto 0); first_val : out std_logic; initialized : out std_logic; addr_grt : out std_logic;

child_cnt_gr : out std_logic; count_gr : out std_logic; all_nodes_done : out std_logic; addr : out std_logic_vector(19 downto 0); cnt1 : out std_logic_vector(9 downto 0); nodeid : out std_logic_vector(9 downto 0); n_type : out std_logic; par : out std_logic_vector(9 downto 0); br_len : out std_logic_vector(15 downto 0));end component;

component mult port(Datainp1 : in std_logic_vector(31 downto 0); Datainp2 : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(31 downto 0));end component;

component divider port(datainp1 : in std_logic_vector(31 downto 0); divider : in std_logic_vector(15 downto 0); output : out std_logic_vector(31 downto 0); valid : out std_logic);end component;

component comparedst port(Datainp1 : in std_logic_vector(31 downto 0); valid_dst : in std_logic; distreg_val : in std_logic_vector(31 downto 0); first_val : in std_logic; addr : in std_logic_vector(19 downto 0); distreginp : out std_logic_vector(31 downto 0); distreg_en : out std_logic; rowreginp : out std_logic_vector(9 downto 0); rowreg_en : out std_logic; colreginp : out std_logic_vector(9 downto 0); colreg_en : out std_logic; addr1_reg_en : out std_logic; addr2_reg_en : out std_logic);end component;

component numofspreg port(numsp : in std_logic_vector(9 downto 0); reset : in std_logic; clk : in std_logic; valid_in : in std_logic; valid : out std_logic; regval : out std_logic_vector(9 downto 0));end component;

component wmemory port ( Clk : in std_logic; Reset : in std_logic; Read : in std_logic; Write : in std_logic; numsp : in std_logic_vector(9 downto 0);

Addr : in std_logic_vector(9 downto 0); Data : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(15 downto 0) );end component; component reg port(data : in std_logic_vector(31 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(31 downto 0));end component;

component reg_2 port(data : in std_logic_vector(9 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(9 downto 0));end component;

component addr_dd port(addr : in std_logic_vector(19 downto 0); reset : in std_logic; clk : in std_logic; addr_dd_s : out std_logic_vector(19 downto 0));end component;

component first_val_ddd port(first_val : in std_logic; reset : in std_logic; clk : in std_logic; first_val_ddd_s : out std_logic);end component;

component d_f port(inp : in std_logic; reset : in std_logic; clk : in std_logic; opt : out std_logic);end component;

-- Multiplier and multiplier reg component signalssignal mult_out : std_logic_vector(31 downto 0);signal mulreg_val : std_logic_vector(31 downto 0);

-- Adder and adderreg component signalssignal adderout : std_logic_vector(31 downto 0);signal adderregval : std_logic_vector(31 downto 0);

-- Adderw and adderwreg component signalssignal adderw_out : std_logic_vector(15 downto 0);signal adderw_regval : std_logic_vector(15 downto 0);

-- Colreg component signalssignal Colregval : std_logic_vector(9 downto 0);

-- Controller component signalssignal store_cur_addr : std_logic;signal a_clr, a_inc : std_logic;signal r_clr, r_inc : std_logic;signal read_wmem : std_logic;signal write_wmem : std_logic;signal rowreg_en : std_logic;signal rowreg_clr : std_logic;signal colreg_en : std_logic;signal colreg_clr : std_logic;signal distreg_clr : std_logic;signal distreg_en : std_logic;signal mulreg_en : std_logic;signal mulreg_clr : std_logic;signal adderw_valid : std_logic;signal addregwclr : std_logic;signal addregwen : std_logic;signal addregclr : std_logic;signal addregen : std_logic;signal divreg1en : std_logic;signal divreg1clr : std_logic;signal initial_run : std_logic;signal numsp_val : std_logic;signal node_mem_initialize : std_logic;signal mem_initialize : std_logic;signal addr_gen1_en : std_logic;signal addr_gen2_en : std_logic;signal nodeid_sel : std_logic_vector(1 downto 0);signal n_type_sel : std_logic;signal spcnt_inc : std_logic;signal incnt_inc : std_logic; signal node_sel_mem_en : std_logic;signal parent_mem_en : std_logic;signal mem_update : std_logic;signal c2_read : std_logic;signal R_dec : std_logic;signal Rp_dec : std_logic;signal c1_incr : std_logic;signal c2_incr : std_logic;signal c1p_incr : std_logic;signal ch_incr : std_logic;signal c1_load1 : std_logic;signal c1_load2 : std_logic;signal c2_load1 : std_logic;signal c2_load2 : std_logic;signal c1p_load : std_logic; signal ch_load : std_logic; signal c1_clr : std_logic;signal c2_clr : std_logic;signal c1p_clr : std_logic; signal ch_clr : std_logic;signal row_col_sel : std_logic_vector(1 downto 0);signal addr2_reg_dec : std_logic;signal node_write : std_logic;

--Optsel signalssignal a_grt : std_logic;signal ex_n : std_logic;signal initialized : std_logic;signal addr_grt : std_logic;signal child_cnt_gr : std_logic;signal count_gr : std_logic;signal all_nodes_done : std_logic;signal addr : std_logic_vector(19 downto 0);signal cnt1 : std_logic_vector(9 downto 0);signal first_val : std_logic;signal nodeid : std_logic_vector(9 downto 0);signal n_type : std_logic;signal par : std_logic_vector(9 downto 0);signal br_len : std_logic_vector(15 downto 0);

-- Divider component signalssignal div_out : std_logic_vector(31 downto 0);signal div_valid : std_logic;

-- Divider reg 1 component signalssignal div_reg_val1 : std_logic_vector(31 downto 0);

-- Least Distance reg component signalssignal distreg_val : std_logic_vector(31 downto 0);

-- Comparedst component signalssignal distreginp : std_logic_vector(31 downto 0);signal rowreginp : std_logic_vector(9 downto 0);signal colreginp : std_logic_vector(9 downto 0);signal addr1_reg_en : std_logic;signal addr2_reg_en : std_logic;

-- numspreg component signalssignal numspreg_val : std_logic_vector(9 downto 0);signal numspreg_valid : std_logic;

-- Rowreg component signalssignal Rowregval : std_logic_vector(9 downto 0);

-- Weight memory component signalssignal WData : std_logic_vector(15 downto 0);signal WAddr : std_logic_vector(9 downto 0);signal vone : std_logic_vector(15 downto 0);signal WData_in : std_logic_vector(15 downto 0);

-- Addr twice registersignal addr_dd_s : std_logic_vector(19 downto 0);

-- first_val registered thricesignal first_val_ddd_s : std_logic;

begin

Mul : mult port map(Data_in, WData_in,

mult_out);

mulreg : reg port map(mult_out, Reset, Clk, mulreg_en, mulreg_clr, mulreg_val); U0 : adder port map(mulreg_val, adderregval, adderout);

adderreg : reg port map(adderout, Reset, Clk, addregen, addregclr, adderregval);

U011 : adder_w port map(WData_in, adderw_regval, adderw_out);

U012 : adderwreg port map(adderw_out, reset, clk, addregwen, addregwclr, adderw_regval);

Colreg : reg_2 port map(Colreginp, Reset, Clk, colreg_en, colreg_clr, Colregval);

U3 : ctrl_blk port map( clk => clk, reset => reset, valid_numsp => valid_numsp, addr_grt => addr_grt, child_cnt_gr => child_cnt_gr, count_gr => count_gr, all_nodes_done => all_nodes_done, initialized => initialized, div_valid => div_valid, a_grt => a_grt, ext_node => ex_n, r_clr => r_clr, r_inc => r_inc, a_clr => a_clr, a_inc => a_inc,

c2_read => c2_read, mem_update => mem_update, R_dec => R_dec, Rp_dec => Rp_dec, c1_incr => c1_incr, c2_incr => c2_incr, c1p_incr => c1p_incr, ch_incr => ch_incr, c1_load1 => c1_load1, c1_load2 => c1_load2, c2_load1 => c2_load1, c2_load2 => c2_load2, c1p_load => c1p_load, ch_load => ch_load, c1_clr => c1_clr, c2_clr => c2_clr, c1p_clr => c1p_clr, ch_clr => ch_clr, row_col_sel => row_col_sel, addr2_reg_dec => addr2_reg_dec, node_write => node_write, read_mem => read_mem, write_mem => write_mem, read_wmem => read_wmem, write_wmem => write_wmem, rowreg_clr => rowreg_clr, colreg_clr => colreg_clr, distreg_clr => distreg_clr, mulreg_en => mulreg_en, mulreg_clr => mulreg_clr, addregwclr => addregwclr, addregwen => addregwen, addregclr => addregclr, addregen => addregen, divreg1clr => divreg1clr, divreg1en => divreg1en, initial_run => initial_run, store_cur_addr => store_cur_addr, node_mem_initialize => node_mem_initialize, mem_initialize => mem_initialize, addr_gen1_en => addr_gen1_en, addr_gen2_en => addr_gen2_en, rmem_read => rmem_read, rmem_write => rmem_write, ad_reg_en => ad_reg_en, ad_reg_clr => ad_reg_clr, row_zero => row_zero, numsp_val => numsp_val, valid_td => valid_td, nodeid_sel => nodeid_sel, n_type_sel => n_type_sel, incnt_inc => incnt_inc, done => done );

U_Addr_dd : addr_dd port map

(addr, reset, clk, addr_dd_s);

U_fv_ddd : first_val_ddd port map (first_val, reset, clk, first_val_ddd_s); Opt_gen : opt_sel port map (clk, reset, valid_numsp, numsp, rowregval, colregval, store_cur_addr, node_mem_initialize, mem_initialize, addr_gen1_en, addr_gen2_en, c2_read, mem_update, addr1_reg_en, addr2_reg_en, R_dec, Rp_dec, c1_incr, c2_incr, c1p_incr, ch_incr, c1_load1, c1_load2, c2_load1, c2_load2, c1p_load, ch_load, c1_clr, c2_clr, c1p_clr, ch_clr, row_col_sel, addr2_reg_dec, node_write, distreg_val, nodeid_sel, n_type_sel, incnt_inc, initial_run, r_clr, r_inc, a_clr, a_inc, a_grt, ex_n,

numsp_1, addr_cnt, row_cnt, first_val, initialized, addr_grt, child_cnt_gr, count_gr, all_nodes_done, addr, cnt1, nodeid, n_type, par, br_len);

U4 : divider port map (adderregval, adderw_regval, div_out, div_valid);

divregister1 : reg port map(div_out, Reset, Clk, divreg1en, divreg1clr, avg_dst);

leastdistreg : reg port map(distreginp, Reset, Clk, distreg_en, distreg_clr, distreg_val); U15 : comparedst port map(Data_in, valid_dst, distreg_val, first_val_ddd_s, addr_dd_s, distreginp, distreg_en, rowreginp, rowreg_en, colreginp, colreg_en, addr1_reg_en, addr2_reg_en);

U10 : numofspreg port map(numsp, Reset,

Clk, numsp_val, numspreg_valid, numspreg_val);

Rowreg : reg_2 port map(Rowreginp, Reset, Clk, rowreg_en, rowreg_clr, Rowregval);

vone <= "0000000000000001"; WData <= vone when node_mem_initialize = '1' else adderw_regval; WAddr <= cnt1 when node_mem_initialize = '1' else addr(19 downto 10); U12 : wmemory port map(Clk, Reset, read_wmem, write_wmem, numsp, WAddr, WData, WData_in);

trout <= n_type & nodeid & par & br_len; mem_addr <= addr;

U20 : d_f port map(ex_n, Reset, Clk, ext_node); end upgma_struct;

APPENDIX B

CUSTOM COMPUTING MACHINE HOST PROGRAM SOURCE CODE

UPGMA_ex.h

#ifndef __UPGMATEST_H__#define __UPGMATEST_H__/**************************************************** * * Contants and Macros * ****************************************************/

#define DEFAULT_VERBOSITY ( FALSE )#define DEFAULT_SLOT_NUMBER ( 0 )#define DEFAULT_ITERATIONS ( 1 )#define DEFAULT_FREQUENCY ( 100.0 )

#define IMAGE_FILENAME ( "pe_addr_mod" )#define IMAGE_FILENAME_REVD ( "pe_addr_mod" )

#define MEM_BASE ( 0x0 )#define LEFT_MEM_OFFSET ( 0x1000 )#define RIGHT_MEM_OFFSET ( 0x1200 )

#define MAX_ERR_COUNT ( 32 )

#define NUM_REGISTERS ( 2 )#define REGISTER_OFFSET ( 0x2000 )

typedef struct _TestInfo_{ WC_DeviceNum DeviceNum; WC_DevConfig DeviceCfg; WC_Version Version; DWORD dIterations; float fClkFreq; BOOLEAN bVerbose;} WC_TestInfo;

/**************************************************** * * Prototypes * ****************************************************/

WC_RetCode WC_UPGMATest_Main( WC_TestInfo *TestInfo );

WC_RetCode WC_UPGMATest_Init( WC_TestInfo *TestInfo );

WC_RetCode WC_UPGMATest_Run( WC_TestInfo *TestInfo );

WC_RetCode WC_UPGMATest_Shutdown( WC_TestInfo *TestInfo );

WC_RetCode VerifyData(DWORD ref[], DWORD test[], DWORD size);

#endif

UPGMA_ex.c

/**************************************************************************** * * File : UPGMAtest.c * * Project : UPGMA on Wildcard * * Copyright : Sreesa Akella, Reconfigurable Computing Research Lab 2003 * ****************************************************************************/#include <stdio.h>#include <time.h>#include <math.h>#if defined(WIN32)#include <windows.h>#endif#include "wcdefs.h"#include "wc_shared.h"#include "UPGMA_ex.h"#include "LAD_Mem_Bridge_WC.h"

/**************************************************************************** * * Function : main * * Description : Entry point for the WILDCARD * This function is a basic entry point into the test. * It is responcibe for * 1) Parsing the command line parametrs and filling the * TestInfo struct with those parameters * 2) Opening the WILDCARD(tm) board * 3) Calling the main example procedure * 4) Closing the board when the example completes * ****************************************************************************/WC_RetCodemain( int argc, char *argv [] ){ WC_RetCode rc = WC_SUCCESS;

int argi;

WC_TestInfo TestInfo;

WC_CardType CardType;

char **TestLoc = NULL;

const char * help_string = "Usage: memtest <list of options>\n" " Options:\n" " -v Sets verbose mode. Show progress messages.\n" " -s <num> Set WILDCARD(tm) device \"slot\" number (default = 0).\n" " -i <num> Sets the number of times to perform the example.\n" " (default = 1)\n" " -f <num> Set the memory clock frequency in MHz (default = 40.0)\n" " -h Show this help.\n";

fprintf( stdout, "WILDCARD(tm) UPGMA_Test Example\n");

TestInfo.bVerbose = DEFAULT_VERBOSITY; TestInfo.DeviceNum = DEFAULT_SLOT_NUMBER; TestInfo.dIterations = DEFAULT_ITERATIONS; TestInfo.fClkFreq = DEFAULT_FREQUENCY;

/* Parse the command line parameters */ for ( argi = 1; argi < argc; argi++ ) { if ( argv[argi][0] == '-' ) { switch ( toupper(argv[argi][1]) ) { case 'H': /* Print the help message */ fprintf ( stdout, "%s\n\n", help_string,1,1 ); return(WC_SUCCESS); break;

case 'I': /* Set the number of iterations */ argi++; TestInfo.dIterations = strtoul( argv [ argi ], TestLoc, 0 ); /* Error Check the result. * The following test will be true only if there was an error * in the string conversion above */ if (TestInfo.dIterations == 0) { fprintf( stdout, "\nWARNING: An invalid or missing iteration value\n"); fprintf( stdout, " was found after the -i option.\n\n"); fprintf( stdout, "%s\n\n", help_string ); return (ERROR_UNKNOWN_SWITCH); }

fprintf( stdout, "Setting the iteration value to %d\n",TestInfo.dIterations); break;

case 'S': /* Set the device number */ argi++; TestInfo.DeviceNum = strtoul( argv [ argi ], TestLoc, 0 ); /* The following tests for a valid slot number */ if (TestInfo.DeviceNum > WC_MAX_DEVICES)

{ fprintf( stdout, "\n WARNING: Invalid device number!\n"); return (ERROR_UNKNOWN_SWITCH); } else { fprintf( stdout, " Setting the device number to %d.\n", TestInfo.DeviceNum); } break;

case 'F': /* Set Frequency */ argi++; if (argi < argc) { TestInfo.fClkFreq = (float) atof( argv [ argi ] ); } else { printf( "\n WARNING: Invalid Frequency option\n" ); printf ( "%s\n\n", help_string ); return(ERROR_UNKNOWN_SWITCH); } if (( TestInfo.fClkFreq < WC_MIN_FCLK_MHZ ) || ( TestInfo.fClkFreq > WC_MAX_FCLK_MHZ )) { printf( "\n WARNING: %3.2f is an invalid Frequency option\n", TestInfo.fClkFreq ); printf ( "%s\n\n", help_string ); return(ERROR_UNKNOWN_SWITCH); } break;

case 'V': /* Show all Errors & set maximum verbosity */ TestInfo.bVerbose=TRUE; fprintf( stdout, " Setting Maximum Verbosity.\n"); break;

default: /* Unknow switch option */ fprintf ( stderr, "\n WARNING: Unknown option: \"%s\"\n", argv [ argi ] ); fprintf ( stderr, "%s\n\n", help_string ); return( ERROR_UNKNOWN_SWITCH ); } } else /* Missing the '-' */ { fprintf ( stderr, "\n WARNING: Unknown option: \"%s\"\n", argv[argi] ); fprintf ( stderr, "%s\n\n", help_string ); return(ERROR_UNKNOWN_SWITCH); } }

/* The WILDCARD(tm) MUST be opened before doing any type of * * access to the card. */ if (TestInfo.bVerbose) { fprintf(stdout,"\n Opening Device %d...\n", TestInfo.DeviceNum); }

rc = WC_Open( TestInfo.DeviceNum, 0 ); DISPLAY_ERROR(rc);

/* If you are using both the WILDCARD(tm) and the WILDCARD(tm)-II * * it is a good idea to check the board type before executing any * * calls. For this example we must have a WILDCARD(tm). */ rc = WC_GetCardType( TestInfo.DeviceNum, &CardType ); if (rc!=WC_SUCCESS) { DisplayError(rc); return 0; } else if (CardType != WILDCARD) { printf("\nERROR : This example requires a WILDCARD(tm).\n It will not run on WILDCARD(tm)-II!\n\n"); return 0; }

/* Once the board is successfully opened, the test may be run */ rc = WC_UPGMATest_Main( &TestInfo); if (rc!=WC_SUCCESS) { DisplayError(rc); }

/* The WILDCARD(tm) should be closed when the program finishes to * * free driver resources */ rc = WC_Close( TestInfo.DeviceNum ); DISPLAY_ERROR(rc); return(rc);}

/**************************************************************************** * * Function : WC_UPGMATest_Main * * Parameters : TestInfo - Test Parameters * * Description : Initializes the WILDCARD(tm) hardware, and runs the example * TestInfo->dIterations times. * ****************************************************************************/WC_RetCodeWC_UPGMATest_Main (WC_TestInfo *TestInfo){ DWORD dIteration, dErrorCount;

WC_RetCode rc = WC_SUCCESS;

/* Print out a few parameters so we know what we are running */ if (TestInfo->bVerbose) {

fprintf(stdout,"\n TEST PARAMETERS:\n"); fprintf(stdout," Clock Frequency = %f\n",TestInfo->fClkFreq); fprintf(stdout," # of Iterations = %d\n",TestInfo->dIterations); fprintf(stdout," Device Number = %d\n",TestInfo->DeviceNum); fprintf(stdout," Verbose Mode = %s\n",TestInfo->bVerbose?"TRUE":"FALSE"); }

/* This routine will put the WILDCARD in a known state. * * We only need to do this before the first iteration. * * Each additional iteration only needs to reset the * * PE to initialize the WILCARD to a known state because * * All initialization parameters are kept between resets.*/ rc = WC_UPGMATest_Init( TestInfo); //CHECK_RC(rc)

/* Now that the PE is initialized, we run the test * * TestInfo->dIterations times, counting the number of * * failures as we go. */ for (dIteration = 0,dErrorCount = 0; dIteration < TestInfo->dIterations; dIteration++) { fprintf(stdout,"\n **** Memory Example Iteration [%d] of [%d] ****\n",dIteration, TestInfo->dIterations); rc = WC_UPGMATest_Run(TestInfo); if (rc != WC_SUCCESS) { DisplayError(rc); dErrorCount++; } }

/* Let the user know if the example was a success */ fprintf(stdout, "\n Example Complete! [%d] of [%d] Successful", TestInfo->dIterations - dErrorCount, TestInfo->dIterations);

if (dErrorCount) { fprintf(stdout, " ERRORS In Example!\n\n"); } else { fprintf(stdout, " Example SUCCESSFUL!\n\n"); }

/* Return SUCCESS if we have made it this far without * * returning. This means that no fatal errors have * * occurred. If any test errors occurred, they have * * already been printed above after each iteration. */ return (WC_SUCCESS);}

/**************************************************************************** * * Function : WC_UPGMATest_Run * * Parameters : TestInfo - Test Parameters *

* Description : Runs the Memory Example. This hardware for this example * contains an image with a LAD_Mem_Bridge component for both * the left and the right onboard memories. This gives the * host indirect access to the onboard WILDCARD(tm) memories. * * This example will write a random pattern to each of the * memories, read it back, and verify that the read and * write contents are equal. * ****************************************************************************/WC_RetCodeWC_UPGMATest_Run( WC_TestInfo *TestInfo ){ WC_Mem_Object *Left_Memory, *Right_Memory;

DWORD dNumDwords, *pReadBuffer, *pWriteBuffer, index, no_of_values, no_of_species, temp1, temp2, *darray, *dataArray, addr, rem, *bin, value, j;

BOOLEAN bIntStatus, done;

FILE *f, *f1; //time_t tStart, tEnd;

//double diffTime = 0.0; clock_t tStartClk, tEndClk, tend_wr, tend_done;

WC_RetCode rc;

/* The first step in an application is almost always to reset the PE. * * Although this is not needed for the first iteration of this * * example because WC_MemTest_Init has already reset the PE, we need * * to do it again here because subsequent iterations need a fresh PE * * reset. * * One assumption made below is that the time between the two * * WC_PeReset calls is sufficient to reset the PE. In this example, * * and in general, this is true. The reset line need only be high * * for at most one clock cycle of the longest period clock. */ fprintf(stdout, "\n Resetting PE ... "); rc=WC_PeReset( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc);

rc=WC_PeReset( TestInfo->DeviceNum, FALSE ); CHECK_RC(rc); fprintf(stdout, "DONE\n");

rc=WC_IntEnable( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc);

/* Reset the interrupts */ rc = WC_IntReset(TestInfo->DeviceNum); CHECK_RC(rc);

rc=WC_PeReset( TestInfo->DeviceNum, FALSE ); CHECK_RC(rc); fprintf(stdout, "DONE\n");

/* Check to make sure that the interrupts are cleared */ fprintf(stdout, " Verifying Interrupts are cleared ... "); WC_IntQueryStatus(TestInfo->DeviceNum, &bIntStatus); if (bIntStatus) { fprintf (stdout, "ERROR\n\n Interrupt ERROR : Interrupts were NOT Cleared\n"); return (WC_ERR_INTERRUPT_TIMEOUT); /* Not a good error code, but will work for now */ } fprintf(stdout, "DONE\n");

/* The simplest method of buffer verification is to have different * * buffers for reading and writing. Below we allocate and initialize * * these buffers. * * First, however, we need to find the memory size so we know how * * large of a buffer to get. This information is stored in the * * device information structure filled by WC_MemTest_Init. */

/* The memory port sizes should be equal, but in case they aren't * * we allocate buffers assuming the largest memory size. */ fprintf( stdout, " Allocating Buffers ... ");

if (TestInfo->DeviceCfg.MemoryDwords[0] >= TestInfo->DeviceCfg.MemoryDwords[1]) { dNumDwords = TestInfo->DeviceCfg.MemoryDwords[0]; } else { dNumDwords = TestInfo->DeviceCfg.MemoryDwords[1]; }

if (TestInfo->bVerbose) fprintf(stdout, "\n * Allocating Read Buffer ... "); pReadBuffer = malloc(dNumDwords * sizeof(DWORD)); if (!pReadBuffer) return (ERROR_MEMORY_ALLOC); memset(pReadBuffer, 0, dNumDwords);

if (TestInfo->bVerbose) fprintf(stdout,"DONE\n * Allocating Write Buffer ... "); pWriteBuffer = malloc(dNumDwords * sizeof(DWORD)); if (!pWriteBuffer) { free(pReadBuffer); return (ERROR_MEMORY_ALLOC); }

/* MY COMMENTS

WE NEED TO INSERT CODE HERE FOR READING THE FILE AND WRITING THE TEST DATA INTO THE BUFFERS ALSO dNumDwords SHOULD BE SET TO THEN NO OF WORDS TO BE WRITTEN BASED ON THE NUMBER OF TAXON FOR NOW WE CAN SET THE BUFFER FOR 4 TAXON DATA - 6 WORDS

*/

darray = malloc(dNumDwords * sizeof(DWORD)); if (!darray) return (ERROR_MEMORY_ALLOC); memset(darray, 0, dNumDwords); dataArray = malloc(dNumDwords * sizeof(DWORD)); if (!dataArray) return (ERROR_MEMORY_ALLOC); memset(dataArray, 0, dNumDwords);

bin = malloc(10 * sizeof(DWORD)); if (!bin) return (ERROR_MEMORY_ALLOC); memset(bin, 0, 10);

//For 4 taxon data //no_of_values = 6; //no_of_species = 4; // READ DATA FROM FILE AND PUT IT IN darray f = fopen("C:/akella/thesis/testdatagen/testdata/taxon16/WCprogram/testdata_16D2.txt", "r");

if(f == NULL){ fprintf(stdout, "DONE\n Open of test file failed."); return 1;

} fscanf(f, "%d", &value); no_of_species = value; no_of_values = (no_of_species * (no_of_species - 1)) / 2; //printf("\nnoOfSpecies - %d, noOfValues - %d", no_of_species, no_of_values); for (index = 0; index <= no_of_values; index ++) darray[index] = 0; index = 0; while(fscanf(f, "%d", &value) != EOF){

darray[index] = value; index++;

} /// Reading from memory and placing data in array format into dataArray /*temp1 = 0; temp2 = 0; for (index = 0; index < no_of_values; index ++) { // read the file here and get one value

// for 4 taxon data value = darray[index];

if (temp2 < no_of_species - 1) temp2 = temp2 + 1; else { temp1 = temp1 + 1; temp2 = temp1 + 1; }

for (j = 0; j < 10; j++) { bin[j] = 0; }

addr = 0; rem = temp1; for (j = 0; j < 10 && rem >= 1; j++) { bin[j] = rem % 2; rem = rem / 2; }

for (j = 0; j < 10; j++) { addr = addr + bin[j]*(pow(2, (10+j))); } addr = addr + temp2; dataArray[addr] = value; //data read from file; } */ dNumDwords = no_of_values; for (index = 0; index < dNumDwords; index ++) { pWriteBuffer[index] = darray[index]; }

/* Now we need to allocate and initialize the memory structures for * * the left and right memories. The WC_MemCreate function will * * allocate the structure and fill it with the device number, * * memory offset and flags. * * NOTE : LEFT_MEM_OFFSET and RIGHT_MEM_OFFSET, refer to LAD bus * * offsets NOT memory addresses. Specific memory addresses to read * * and write to are passed to the WC_MemRead and WC_MemWrite * * procedures. */ if (TestInfo->bVerbose) fprintf(stdout,"DONE\n * Allocating Left Memory Struct ... "); Left_Memory = WC_Mem_Create( TestInfo->DeviceNum, LEFT_MEM_OFFSET, 0 ); if (!Left_Memory) { free(pReadBuffer); free(pWriteBuffer); return (ERROR_MEMORY_ALLOC); }

if (TestInfo->bVerbose) fprintf(stdout,"DONE\n * Allocating Right Memory Struct ... ");

Right_Memory = WC_Mem_Create( TestInfo->DeviceNum, RIGHT_MEM_OFFSET, 0 ); if (!Right_Memory) { free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); return (ERROR_MEMORY_ALLOC); }

printf ("DONE\n");

/* With the memory structures initialized, we can now read and write * * to the memories. First we will write to the LEFT memory. */

fprintf(stdout, " Testing LEFT Memory ... "); /* Find the size of this particular memory, and initialize the write * * buffer. */ //time(&tStart); tStartClk = clock(); /* The following two calls, WC_Mem_Write and WC_Mem_Read are defined in * * LAD_Mem_Bridge.c. They uses a specific protocal to interace with the * * LAD_Mem_Bridge component in the PE to read and write data in memory. * * See the documentation inside the LAD_Mem_Bridge.c file for details of * * this memory protocol. */ if (TestInfo->bVerbose) fprintf(stdout, "\n * Writing to memory ... "); rc = WC_Mem_Write(Left_Memory,MEM_BASE,dNumDwords,pWriteBuffer); if (rc != WC_SUCCESS) {

free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); WC_Mem_Release(Right_Memory); return (rc); }

/*WRITE THE NO OF TAXON DATA TO THE LAD Reg TO START THE UPGMA DESIGN */ if (TestInfo->bVerbose) fprintf(stdout, "DONE\n * Writing no of taxon to Registers ... ");

rc = WC_PeRegWrite(TestInfo->DeviceNum, REGISTER_OFFSET, NUM_REGISTERS, &no_of_species); if (rc != WC_SUCCESS) { fprintf(stdout, "\n * Wrting the no of taxons didnt work!!"); } tend_wr = clock();

value = tend_wr - tStartClk; fprintf(stdout, "DONE\nTime taken for Memory write is: %d", value);

/* READ THE REG FILE UNTIL DONE SIGNAL HAS BEEN SET */ //printf("DONE\nThe value of darray[1] is %d", darray[1]); done = FALSE; while(!done){

rc = WC_PeRegRead(TestInfo->DeviceNum, REGISTER_OFFSET, NUM_REGISTERS, darray);

if (rc != WC_SUCCESS) {

fprintf(stdout, "\n * Cudnt read the register!!"); } //printf("\nThe value of darray[1] is %d", darray[1]); /*rc = WC_Mem_Read(Right_Memory, MEM_BASE, dNumDwords, pReadBuffer);

if (rc != WC_SUCCESS) {

free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); WC_Mem_Release(Right_Memory); return (rc);

} printf("\nThe value of rmem[0] is %d", pReadBuffer[0]);*/ if (darray[1] == 1){

tend_done = clock(); done = TRUE;

} else

done = FALSE;

}

//tend_done = clock();

value = (tend_done - tend_wr); //fprintf(stdout, "\ntdone = %d, tendWrite= %d", tend_done, tend_wr); fprintf(stdout, "\nTime taken for done signal is : %d", value);

/* // WAIT FOR INTERRUPT WHICH INDICATES THAT DESIGN HAS COMPLETED RUNNING fprintf(stdout, " Waiting for interrupt ... "); rc = WC_IntWait(TestInfo->DeviceNum, 1000 ); CHECK_RC(rc); */

/* READ RIGHT MEMORY AFTER INTERRUPT IS GENERATED dNumDwords SHOULD BE SET TO NO OF DWORDS TO BE READ */ dNumDwords = (2*no_of_species) - 1; if (TestInfo->bVerbose) fprintf(stdout, "\n * Reading from memory ... "); rc = WC_Mem_Read(Right_Memory, MEM_BASE, dNumDwords, pReadBuffer); if (rc != WC_SUCCESS) { free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); WC_Mem_Release(Right_Memory);

return (rc); } //time(&tEnd); tEndClk = clock(); //diffTime = difftime(tEnd, tStart);

value = tEndClk - tend_done; fprintf(stdout, "DONE\nTime taken for memory read is: %d", value);

value = (tEndClk - tStartClk); printf("\nTotal Time taken: %d", value);

// PRINTING THE TREE DATA f1 = fopen("C:/akella/thesis/thesisOct03/OutputData/TreeData_16D1.txt", "w"); for(index = 0; index < dNumDwords; index++) fprintf (f1, "\nTaxon - %d: %li", index, pReadBuffer[index]);

free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); WC_Mem_Release(Right_Memory); printf ("\nDONE\n"); return (rc);

}

/**************************************************************************** * * Function : DeviceInitialize * * Notes : This function puts the card into a known state before the * test begins. It is generally a bad idea to assume the * state of the WILDCARD's hardware when a program starts. * Previous programs can leave the hardware in an unknown * state, and it's state on power-on is undefined. If an * application requires a specific state of the hardware, * explicitly set that state. * * Before running any application the following steps * should be performed in the order given. * * 1) Toggle Power * 2) Assert the processing element reset line * 3) Program the processing element * 4) Set the clock frequency * 5) Configure Interrupts * 6) Deassert the processing element reset line * ****************************************************************************/WC_RetCodeWC_UPGMATest_Init( WC_TestInfo *TestInfo ){ WC_RetCode

rc=WC_SUCCESS;

/* A great deal of useful information is available from the ID * * PROM on the WILDCARD(tm), including processing element part * * type, memory size, speed grade, etc. The two API calls, * * WC_DeviceInformation and WC_GetVersion are used to retrieve * * that information. The procedure DisplayConfiguration, * * defined in wc_shared.c, displays this information to the * * screen. * * * * Below we use the API calls to store the WILDCARD(tm) device * * and version information in the TestInfo struct for use later * * in the example, as well as display the information if * * verbosity is on. */

rc = WC_DeviceInformation( TestInfo->DeviceNum, &(TestInfo->DeviceCfg) ); CHECK_RC(rc); rc = WC_GetVersion( TestInfo->DeviceNum, &(TestInfo->Version)); CHECK_RC(rc);

if (TestInfo->bVerbose) { rc=DisplayConfiguration(TestInfo->DeviceNum); CHECK_RC(rc); }

/* It should NOT be assumed that the WILDCARD(tm) processing * * element currently has power. Below we toggle the power to * * the processing element, leaving it ON for the remainder of * * the example. */ if (TestInfo->bVerbose) { fprintf (stdout, " Toggling processing element's power...\n"); } rc=WC_PeApplyPower ( TestInfo->DeviceNum, FALSE ); CHECK_RC(rc); rc=WC_PeApplyPower ( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc);

if (TestInfo->bVerbose) fprintf (stdout, " PE power turned on.\n");

/* The WILDCARD(tm) has a dedicated reset line controlled by * * the WC_PeReset API call. In general it is advantageous * * to have the PE in reset when it is being set up. This * * will prevent the design from starting execution until the * * WILDCARD(tm) has been correcly initialized. * * * * Below we assert the reset line and keep it asserted * * until the processing element has been programmed, the * * clock has been set, and interrupts have been initialized. * * * If the Reset_STD_If has been instantiated in the VHDL, * * this API call will set the signal 'Global_Reset' high. */

if (TestInfo->bVerbose)

fprintf (stdout, " Asserting PE Reset Line...\n"); rc=WC_PeReset( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc); if (TestInfo->bVerbose) fprintf (stdout, " PE RESET line asserted.\n");

/* As of the creation of this file there are 4 revisions of * * the WILDCARD(tm) hardware. (Revs A to D) Below we use * * the informaion in TestInfo->Version to determine the * * revision of the card in this slot. */ rc = WC_GetVersion( TestInfo->DeviceNum, &TestInfo->Version); CHECK_RC(rc);

if (TestInfo->bVerbose) fprintf (stdout, " Loading PE Image...\n");

if (((TestInfo->Version.Hardware & WC_MAJOR_VER_MASK)>>WC_MAJOR_VER_SHIFT) == 4) { /* REV D WILDCARD(tm) * * * * The ProgramPeFromFile procedure, found in ws_shared.c, * * will append .\<PART TYPE>\<PACKAGE TYPE>\ to the * * filename, and load that file into the processing element.* * For a REV D this path will be * * .\XCV300E\PKG_BG352\<IMAGE_FILENAME_REVD> */ rc=ProgramPeFromFile( TestInfo->DeviceNum, IMAGE_FILENAME_REVD ); CHECK_RC(rc); } else { /* REV A-C WILDCARD(tm) * * * * The ProgramPeFromFile procedure, found in ws_shared.c, * * will append .\<PART TYPE>\<PACKAGE TYPE>\ to the * * filename, and load that file into the processing element.* * * * For a REV C this path will be * * .\XCV300E\PKG_BG352\<IMAGE_FILENAME> * * * * For REVs A or B this path will be * * .\XCV300\PKG_BG352\<IMAGE_FILENAME> */ rc=ProgramPeFromFile( TestInfo->DeviceNum, IMAGE_FILENAME ); CHECK_RC(rc); }

if (TestInfo->bVerbose) fprintf (stdout, " PE Image Loaded.\n");

/* The WILDCARD(tm) has one on-board programmable oscillator. * * WC_ClkSetFrequency sets the frequency of that clock. We * * always want to set the clock to the appropriate frequency * * before running our application. */

if (TestInfo->bVerbose) fprintf(stdout, " Initializing the clock to %f...\n", TestInfo->fClkFreq); rc=WC_ClkSetFrequency ( TestInfo->DeviceNum, TestInfo->fClkFreq );

CHECK_RC(rc); if (TestInfo->bVerbose) fprintf(stdout, " Clock initialized.\n"); /* This application uses the PE interrupt line, to * * generate an interrupt to the host. Interrupts must * * be enabled before we can receive an interrupt from * * the PE. */ if (TestInfo->bVerbose) fprintf (stdout, " Masking PE Interrupt...\n"); rc=WC_IntEnable( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc); if (TestInfo->bVerbose) fprintf (stdout, " PE Interrupt Masked.\n");

/* The order of mask / reset may be important in some * * circumstances. In our case it is not. We mask, * * then clear anything that may have been happened before * * the masking operation */ if (TestInfo->bVerbose) fprintf (stdout, " Resetting PE Interrupt...\n"); rc=WC_IntReset( TestInfo->DeviceNum ); CHECK_RC(rc); if (TestInfo->bVerbose) fprintf (stdout, " PE Interrupt Reset.\n");

/* Lastly, we remove the PE from the RESET state. When * * the Reset_STD_IF is instantiated in the VHDL, this * * will set the VHDL signal 'Global_Reset' low. */

if (TestInfo->bVerbose) fprintf (stdout, " De-asserting PE Reset Line...\n"); rc=WC_PeReset( TestInfo->DeviceNum, FALSE ); CHECK_RC(rc); if (TestInfo->bVerbose) fprintf (stdout, " PE RESET line de-asserted.\n");

return(rc);}

cse.sc.educse.sc.edu/.../reconfigurable_papers/Akella-Thesis-031109Abstract… · Web viewby....

Documents

Transcript of cse.sc.educse.sc.edu/.../reconfigurable_papers/Akella-Thesis-031109Abstract… · Web viewby....