cse.sc.educse.sc.edu/.../reconfigurable_papers/Akella-Thesis-031109Abstract… · Web viewby....
Transcript of cse.sc.educse.sc.edu/.../reconfigurable_papers/Akella-Thesis-031109Abstract… · Web viewby....
DESIGN AND ANALYSIS OF A CUSTOM COMPUTING ARCHITECTURE FOR
THE UPGMA BIOINFORMATICS ALGORITHM
by
Sreesa Akella
Bachelor of ScienceAndhra University, 1998
__________________________________________________
Submitted in Partial Fulfillment of the
Requirements for the Degree of Master of Science in the
Department of Computer Science and Engineering
College of Engineering and Information Technology
University of South Carolina
2003
______________________________ ______________________________Department of Computer Science and Department of Computer Science andEngineering EngineeringDirector of Thesis 2nd Reader
______________________________ ______________________________Department of Computer Science and Dean of the Graduate SchoolEngineering 3rd Reader
ACKNOWLEDGEMENTS
I would like to express my deepest gratitude to my thesis advisor, Dr. James P.
Davis for the relentless motivation and support he provided, aiding me to complete my
thesis on time. His unflinching optimism, undying enthusiasm and focus towards this
project had inspired me to a great degree. His constant advise pushed me to look at a
problem in a different perspective and helped me visualize concepts in a broader manner.
I would like to extend my appreciation to Dr. Duncan Buell and Dr. John Rose for
their continous guidance and inspiration. Their valuable advise from time to time had
given this project an optimal direction.
I would also like to thank my parents and friends who have been constant force
of motivation and support that sustained me through tough times and helped me achieve
this goal.
ii
ABSTRACT
In recent years, reconfigurable custom computing has become an increasingly
viable option for implementing high-performance computing applications.
Reconfigurable VLSI logic, on which custom computing systems are built, provides
several orders of magnitude speed-up in execution performance of algorithms over the
execution of these on conventional microprocessor-based systems. In addition, such
systems have the flexibility to program--and reprogram via reconfiguration--the actual
logic functions of the VLSI circuit with different applications in time and space. Custom
computing systems are implemented using FPGA custom-logic devices that are easily
and quickly programmed by an end-user. This research presents the design and analysis
of a custom computing application architecture for the UPGMA Bioinformatics algorithm
implemented on an FPGA-based custom-computing platform. We present the
Bioinformatics problem domain and architectures that were implemented and assessed.
We also discuss the final architecture created and present results of the system
performance, as measured and compared against that of the UPGMA algorithm written in
C, running on a single-processor Pentium® PC.
iii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS.............................................................................................II
ABSTRACT.....................................................................................................................III
TABLE OF CONTENTS................................................................................................IV
LIST OF TABLES.........................................................................................................VII
LIST OF FIGURES.....................................................................................................VIII
INTRODUCTION.............................................................................................................1
1.1 VON NEUMANN VERSUS RECONFIGURABLE CUSTOM COMPUTING...........................11.2 CUSTOM LOGIC DESIGN VERSUS CUSTOM COMPUTING............................................21.3 FIELD PROGRAMMABLE GATE ARRAYS.....................................................................31.4 APPLICATION PROGRAMMING AND DESIGN STYLES..................................................51.5 THESIS PROPOSAL......................................................................................................7
1.5.1 Thesis Research Objective and Tasks.................................................................8
BACKGROUND..............................................................................................................11
2.1 PHYLOGENETICS AND TREE-RECONSTRUCTION METHODS......................................112.1.1 Background on trees.........................................................................................122.1.2 Phylogenetic Algorithms..................................................................................13
2.2 THE UPGMA...........................................................................................................142.2.1 Algorithm..........................................................................................................142.2.2 Complexity and Bottlenecks on UPGMA........................................................16
2.3 FIELD PROGRAMMABLE GATE ARRAYS...................................................................172.3.1 Input Output Blocks (IOBs).............................................................................182.3.2 Configurable Logic Blocks (CLBs)..................................................................192.3.3 Programmable Routing Matrix.........................................................................202.3.4 Resources on a Virtex-E chip...........................................................................21
2.4 RECONFIGURABLE COMPUTING...............................................................................22
DISCUSSION OF THE WILDCARD CUSTOM COMPUTING PLATFORM......25
3.1 THE ANNAPOLIS WILDCARDTM SYSTEM...............................................................253.2 THE WILDCARDTM SYSTEM VHDL MODEL..........................................................273.3 WILDCARDTM HOST PROGRAMMING.....................................................................29
3.3.1 Opening and Closing the WILDCARDTM board..............................................303.3.2 Clock Control...................................................................................................313.3.3 Processing Element and Interrupt Control........................................................313.3.4 Memory Control...............................................................................................32
3.4 PE EMBEDDED APPLICATION INITIALIZATION.........................................................33
CUSTOM COMPUTING DESIGN OF UPGMA........................................................34
iv
4.1 VLSI DESIGN FLOW.................................................................................................344.2 UPGMA PROJECT DESIGN FLOW............................................................................364.3 UPGMA DESIGN......................................................................................................37
4.3.1 Design Parameters............................................................................................384.3.2 Design Datapath................................................................................................394.3.3 Design Architecture..........................................................................................414.3.1 Adder................................................................................................................414.3.2 Add Register.....................................................................................................424.3.3 Height Adder....................................................................................................434.3.4 Height Register.................................................................................................434.3.5 Multiplier..........................................................................................................434.3.6 Multiplier Register............................................................................................444.3.7 Divider Unit......................................................................................................444.3.8 Divider Register................................................................................................444.3.9 Comparator.......................................................................................................454.3.10 Least Distance Register..................................................................................454.3.11 Row and Column Registers............................................................................454.3.12 Controller........................................................................................................454.3.13 Counter Units..................................................................................................474.3.14 Multiplexers....................................................................................................474.3.15 Address Generator..........................................................................................474.3.16 Output Generator............................................................................................514.3.17 Height memory...............................................................................................514.3.18 Off-chip Memory Banks.................................................................................514.3.19 Addressing Schemes.......................................................................................534.3.20 Top-level Block..............................................................................................56
4.4 DESIGN VERIFICATION.............................................................................................56
EXPERIMENTAL DATA SET AND PERFORMANCE MEASUREMENT...........58
5.1 EXPERIMENTAL APPARATUS FOR UPGMA.............................................................585.2 Generating Random Taxa Test Data Sets............................................................605.3 Measuring Time...................................................................................................62
EXPERIMENTAL METHOD AND RESULTS...........................................................65
6.1 RUNNING THE EXPERIMENTS...................................................................................656.2 EXPERIMENTAL RESULTS FOR LATENCY.................................................................666.3 BOUNDING TIME COMPLEXITY................................................................................726.4 BENCHMARKING AGAINST PHYLIP........................................................................75
SUMMARY AND CONCLUSIONS..............................................................................83
7.1 SUMMARY OF RESEARCH CONTRIBUTIONS..............................................................837.2 CONCLUSIONS..........................................................................................................847.3 FUTURE WORK.........................................................................................................86
7.3.1 Memory size and Memory address schemes....................................................867.3.2 Latency for a Memory Read.............................................................................877.3.3 Device size........................................................................................................88
v
BIBLIOGRAPHY............................................................................................................90
APPENDIX A...................................................................................................................93
VHDL SOURCE CODE................................................................................................93
APPENDIX B.................................................................................................................162
CUSTOM COMPUTING MACHINE HOST PROGRAM SOURCE CODE...........162
vi
LIST OF TABLES
TABLE 1 Virtex – E Chip Resources………………...……………….…………….21
TABLE 2 Address Mapping…………….…………………………….…………….55
TABLE 3 Timing Results for permuted data for 32 taxa dataset….….…………….67
TABLE 4 Latency Values for Datasets at Generated Number of Taxa.…………….71
TABLE 5 PHYLIP Run-time Raw Dataset………………………..….…………….78
TABLE 6 Data Comparison Between Hardware and Software UPGMA
Implementations………………………………………………………….79
LIST OF FIGURES
FIGURE 1 Architecture of an FPGA Device….………………………………………4
FIGURE 2 A Phylogenetic Tree showing a Relationship between Four Species...….12
FIGURE 3 Distance Matrix…………………………………………….…………….15
FIGURE 4 Structure of Xilinx XCV300E Device...………………….…………….18
FIGURE 5 Virtex – E Input Output Block Architecture……………….…………….19
FIGURE 6 A Two-Slice Virtex – E CLB...…………………………….…………….20
FIGURE 7 The WILDCARDTM Platform Block Diagram…………….…………….26
FIGURE 8 The WILDCARDTM Software Design Hierarchy...…….….…………….30
FIGURE 9 An HDL-based Design Process Model…………………….…………….35
FIGURE 10 Design Datapath...………………………………………….…………….40
FIGURE 11 Block Diagram of UPGMA Architecture………………….…………….42
FIGURE 12 The Controller Algorithm………………………………….…………….46
FIGURE 13 Typical Read Cycle from Memory..……………………….…………….52
FIGURE 14 Typical Write Cycle from Memory.……………………….…………….52
FIGURE 15 Distance Matrix…………………………………………….…………….53
FIGURE 16 Test Data Generator Input Dialog Box………………………………......61
FIGURE 17 Frequency Distribution for Latency versus Taxa Data Set Permutation...67
FIGURE 18 Mean Latency versus Number of Taxa (Normal Scale)..….…………….69
FIGURE 19 Latency versus Number of Taxa..………………………….…………….70
FIGURE 20 Bounding of Latency by time Complexity Functions..…….…………….73
FIGURE 21 Bounding Latency by Time Complexity Functions Computed in
Excel……………………………………………………………….…….74
FIGURE 22 PHYLIP C run-time performance...……………………….………….….76
FIGURE 23 PHYLIP C run-time performance with Time-Complexity Bounding.…..77
FIGURE 24 Plotting the performance improvement over PHYLIP as Taxa Count
grows.…………………………………………………………………….80
FIGURE 25 Plotting the performance difference as Taxa count grows...……….…….81
FIGURE 26 Plotting the performance difference as Taxa count grows (Log Plot).…..82
CHAPTER 1
INTRODUCTION
1.1 Von Neumann versus Reconfigurable Custom Computing
In recent years, reconfigurable custom computing has become an increasingly
viable option for implementing applications requiring high-performance or complex
computations. It is an area that is not as mature as the use of conventional computing
architectures. Traditionally, general-purpose computing involves a serial thread of
executing code running on one or more microprocessors. This microprocessor-based
computing paradigm is considered "general-purpose" in that the processor can be
programmed to run any task—which is an executing application program running on an
operating system or monitor program. Once a processor has been designed and
fabricated, the single processor’s IC can solve multiple problems at different points of
time, by fetching program instructions and data from memory, decoding them to
determine an execution plan, then executing each such instruction, in turn.
Reconfigurable computing can also be called "general-purpose", although it uses a
different architecture and supporting application development paradigm for computation.
Unlike a microprocessor, which has its computation as a set of sequential instructions
fetched from system memory, reconfigurable architectures generally compute a function
by configuring functional units and wiring them up in space. This allows a parallel
computation of operators and direct dataflow from the producers of an intermediate result
to the consumers [1, 2].
1.2 Custom Logic Design versus Custom Computing
Application Specific Integrated Circuits (ASICs) could also be used to implement
a design and optimize it to achieve high performance employing spatial architectures.
ASICs, however, are designed using custom logic techniques, creating design artifacts
tailored for a specific application, and thus cannot be reconfigured to perform different
applications. Therefore, although these systems provide high performance through
application-specific optimization, they are not “general purpose”. One other aspect of
ASIC systems is that they have a huge manufacturing cost associated with them.
Reconfigurable systems, on which custom computing systems are built, provide
very good performance and the flexibility to program--and reprogram via reconfiguration
of the logic functionality--the actual device logic, with different applications in time and
space. Additionally, these systems are implemented using FPGA devices that are easily
and quickly programmable by and end-user, are available at affordable prices, and thus
deliver user-defined functionality at a low cost. The performance and logic density of a
single FPGA device have been improving in recent years, leading to more powerful
reconfigurable architectures targetable for a wider range of applications. This has opened
up the use of FPGAs, typically employed in the creation of logic controllers, as
processing elements (PEs) in reconfigurable arrays in applications for high-performance
computing.
1.3 Field Programmable Gate Arrays
In the past few years, the reconfigurable device market has grown considerably
with the availability of a wide range of devices for VLSI systems--one such device being
a Field Programmable Gate Array (FPGA). FPGAs have evolved considerably in the
recent past, with the primary development being the ability to download a bitstream
representing the digital logic functions onto an array of pre-defined arithmetic, logical
and steering resources, so they have become the primary device for building
reconfigurable and adaptive machines. They were originally designed as prototype
devices used for pre-fabrication design emulation. This design activity was employed to
verify the design before fabrication, to avoid the fallout of post-fabrication design error.
A Xilinx FPGA device that is primarily the device we are looking at has a standard
architecture, which is shown in Figure 1 [3].
FPGAs consist of an array of resource types: configurable logic blocks (CLBs),
input/output blocks (IOBs), and programmable interconnect resources. This standard
architecture can be configured, and reconfigured if necessary, by an end user to
implement a particular functionality. The logic blocks are used to implement the required
logic gate and storage elements of the design. The interconnect can be programmed to
appropriately connect the logic blocks to realize a larger functional unit specified for use
by the application.
Figure 1. Architecture of an FPGA Device.
For purposes of consideration in this thesis, the design process of the FPGA device
has the following steps:
Model the design using a hardware description language such as VHDL or
through schematic capture.
Synthesize this design to generate a netlist.
Map the design to the FPGA logic blocks.
Place and route of the design to choose specific logic blocks to use on the FPGA
and to allocate the wire segments to interconnect these logic blocks.
Download the design as a bitstream onto the target FPGA chip.
Steps 2 through 5 are automated and are performed by an assortment of design tools
generally provided by the FPGA device vendor. Some of the major FPGA device
manufacturers and vendors in the market are Xilinx, Actel, and Altera.
In order to use these devices for reconfigurable computing applications, one has
to deal with a number of FPGA issues so as to effectively implement the design. The
computational requirements of the application must be identified and its mapping to the
FPGA device must be evaluated via estimation. This is no easy task, and there is no
standard method to assemble designs. The FPGA tools, which play a major part in this
process, are being continuously improved by the vendors to be more efficient in their
mapping of design architecture to design resources. The trend is that, over time, the
construction of reconfigurable computing systems on FPGAs will be more like software
programming than the hardware design process for custom VLSI that exists today.
1.4 Application Programming and Design Styles
The process of converting a specification into an implementation on FPGA
devices can be addressed in different ways. Different design styles lead to different
interpretations of the specification—a formal or informal description of the application’s
algorithm. An algorithm can be thought of as a set of processing steps for transforming
data by executing a series of computations [4]. The algorithm needs to be interpreted by
a machine to perform the work. Choosing the elements that make up the machine defines
its architecture, and this necessitates looking at different architectural, or design, styles.
Traditionally there have been two generic architectural styles: the software paradigm and
the hardware paradigm [4]. The software paradigm looks at implementing an algorithm
through use of an instruction code sequence that is interpreted by a microprocessor. In
contrast, using the hardware paradigm, an algorithm is mapped onto storage and
functional units that perform the computation without the use of an intermediate
instruction set.
Under the software paradigm, a program for the algorithm is written in a high
level language such as C/C++, which is compiled into a low-level instruction set for the
processor to execute on an underlying hardware with a fixed architecture. A hardware
implementation would look at implementing the design directly onto a hardware device
through mapping to storage and functional units, avoiding the compile-time and operating
system overhead present in the software paradigm. This can provide considerable speed-
up—on the order of two magnitudes--and thus provide a much higher-performance
solution; however, a VLSI hardware application solution generally comes at a higher
cost, since fabricating the implementation on application specific devices is expensive.
At the same time, such application formulations using application-specific VLSI custom
logic are not general purpose, thus necessitating different implementations for different
algorithms. In contrast, the software model would yield a generally lower-performance
solution through the overhead associated with instruction fetch-decode and execute;
however, the solution would generally be cheaper, since microprocessors are mass-
produced, reusable commodity off-the-shelf products, and programming them is not a
difficult task. Furthermore, there are more software-trained professionals who can write
programs on general-purpose processors than there are design engineers who can design
custom-logic VLSI.
FPGAs provide a means to build general-purpose, reconfigurable machines at a
lower cost1. This leads to new design style that can be referred to as the reconfigurable-
computing paradigm, also referred to as the configurable hardware paradigm [4]. This
paradigm supports the implementation of algorithms by providing the performance
benefits from mapping directly onto a hardware platform at a relatively low cost. Thus, it
would be interesting to look at implementing various applications on reconfigurable 1 Such fixed cost is referred to as NRE, or non-recurring engineering costs, which are associated with the
specification, design, implementation, mapping, and test of the logic functions implementing an application on a VLSI device substrate. This is in contrast to the variable costs associated with the fabrication and
production of finished devices, which is based on the volume of production—itself based on the demand.
platforms and evaluating their performance as compared to implementations using the
software paradigm.
Such performance would include conventional notions of latency associated with
carrying out computation, comparing between an application-specific software solution
running on a conventional processor architecture (or even among a collection of
processors, thus distributing the algorithm’s execution across multiple, communicating
processing elements), and also throughput of the architecture to run streaming
computation, if appropriate for the application. However, evaluating performance could
also include comparing the design time of the application—comparing the time to
architect, design, implement and test the application according to the requisite
engineering processes of each paradigm.
1.5 Thesis Proposal
The reconfigurable computing paradigm--and the predominant FPGA device
architecture on which such applications can be built--offers us a good medium for
implementing complex computational tasks having high throughput, low latency
requirements. Many computational tasks spread over a range of application domains
have been implemented and evaluated on reconfigurable computing systems [5, 6, 7, 8,
9]. However, different aspects of application architecture and performance must continue
to be explored, while many new and novel computational problems must be implemented
using reconfigurable custom computing machines before a general understanding of the
characteristics of the reconfigurable computing paradigm can be obtained. This would
provide a wider set of configurable computing solutions, as well as patterns for mapping
between high-level problem-solving architecture and lower-level device architectures,
which can be used to assess the cost/benefit ratio for effective and optimal
implementation of more general programming problems on reconfigurable platforms.
Our research thesis involves examining one such data point in the space of
possible application solutions where high-performance computing using reconfigurable
hardware is required for operating on ever-growing data sets. Namely, we are looking at
the Phylogenetics domain that provides us with a rich set of algorithms that can be
studied to see if they can be implemented efficiently on reconfigurable computing
machines to provide orders of magnitude speedup in the algorithm execution over that
available on standard Von Neumann processor architectures programmed using
conventional programming techniques. In this Bioinformatics domain, the Unweighted
Pair-Group Method with Arithmetic means (UPGMA) algorithm used for phylogenetic
tree-reconstruction purposes has certain computational complexity that makes it an
application of specific interest. Furthermore, it is understood to have a software
programmed implementation that is particularly optimal, that is, it cannot be further
optimized to achieve significant speedup in performance.
1.5.1 Thesis Research Objective and Tasks
It is therefore our objective to explore the space of possible architectures in
custom reconfigurable logic, using FPGA devices as an implementation medium, and
also using conventional custom-logic design processes, to implement a different
“rendition” of the UPGMA algorithm and measure the performance difference. It is our
belief that—although the time complexity of the algorithm is unlikely to change as a
result of implementation in FPGA custom-logic hardware, we do believe that the use of
custom logic VLSI Hardware design techniques should yield up to a two-order-of-
magnitude improvement in the execution speed of the UPGMA algorithm over that
employed in the PHYLIP program written by Felsenstein et al. [10].
The tasks involved in exploring this thesis research work are defined as follows:
1) Select the UPGMA algorithm[11] that performs phylogenetic analysis by building
an evolutionary tree as our problem domain.
2) Identify and analyze the various complex computational tasks and bottlenecks.
3) Evaluate the issues that we need to address in implementing this algorithm on a
reconfigurable custom-logic architecture.
4) Address various FPGA issues while developing a hardware architecture for the
particular problem algorithm at hand.
5) Implement this design on a FPGA-based reconfigurable architecture and device
platform.
6) Evaluate its performance by measuring its throughput with an increase in the
number of taxa, and benchmark these results against those obtained from a
software program (Felsenstein’s PHYLIP) executing on a conventional CPU-
based system.
The Annapolis WILDCARDTM system has been chosen as the target reconfigurable
platform. The WILDCARDTM FPGA board has a Xilinx Virtex® XCV 300E2 as a
processing element, along with two 256K byte memory units, and external I/O
connections. This reconfigurable computing platform was chosen primarily based on
cost and the availability of a reasonable set of platform development tools.
2 Xilinxand Virtex are Registered Trademarks of Xilinx Inc.
Thus, this thesis will attempt to modify the upper bound of the time complexity,
corresponding to a modification of the time constant associated with the complexity
function for the UPGMA algorithm to achieve orders-of-magnitude speedup, while also
contending with the space complexity associated with the limited amount of device
resources available on the Wildcard platform. In addition, given that we will be moving
data to and from the main computer in which the WILDCARDTM sits, and the
WILDCARDTM PCI/PCMCIA board itself, we will be required to assess the penalties
associated with the communication overhead—with the objective of minimizing this as
much as possible.
CHAPTER 2
BACKGROUND
In this chapter, we provide background on the application domain associated with
the UPGMA algorithm and its context in the space of Bioinformatics computational
problem solving. We also discuss the FPGA device technology, which constitutes the
platform on which we will create a reconfigurable computing solution for the UPGMA
problem.
2.1 Phylogenetics and Tree-reconstruction Methods
The study of the relationships between groups of organisms is called taxonomy,
an ancient and venerable branch of classical biology. The branch of taxonomy that deals
with numerical data such as DNA sequences is known as phylogenetics. Biological
systematists who wanted to reconstruct evolutionary genealogies of species based on
morphological similarities originally developed phylogenetic analysis. The results of
phylogenetic analysis may be depicted as a hierarchical branching diagram, a
"cladogram" or "phylogenetic tree" as shown in Figure 2 [12].
Figure 2: A phylogenetic tree showing a relationship between four species.
2.1.1 Background on trees
The tree represents the genealogical evolution of the different species, linking
them through a certain set of similarities and differences. Similarities and differences
between organisms can be coded as a set of characters, each with two or more alternative
character states. In an alignment of DNA sequences, for example, each aligned site is a
separate character, each with four character states, the four nucleotides being adenine,
thymine, cytosine, and guanine.
All the trees are assumed to be binary, meaning that each node branches into two
daughter edges as shown in Figure 2. The edges meet at a branch node, a node being and
endpoint of an edge. Each edge has a certain amount of evolutionary divergence
associated with it, quantified by some distance between sequences. These distances are
referred to as ‘edge lengths’ or ‘branch lengths’. Terminal nodes or leaves correspond to
the observed sequences that might connect up to an ultimate ancestor or ‘root’ of the tree.
A true biological phylogeny has a ‘root’ but only some phylogenetic algorithms provide
information about the location of the root.
For a specific set of n leaves, the nodes and edges of a tree can be counted as
follows: There would be (n–1) nodes in addition to the n leaves, giving a total of
(2n-1) nodes and one fewer edges, that is (2n-2), discounting the edge above the
root node.
2.1.2 Phylogenetic Algorithms
Phylogenetic algorithms cover three main classes of problems [13]: (1)
parsimony, which is like a vertex coloring problem of graph theory; (2) distance
methods, which aim to find a tree whose path distance matches closely to observed
distances; and, (3) likelihood methods, where the likelihood of the data is
calculated using Markov transition matrices. Each approach possesses certain problems
in terms of the computational bottlenecks that occur.
The advantages of putting a phylogenetic algorithm onto reconfigurable custom
computing platform include the following: (1) eliminating intervening levels of
software--such as operating systems--which slows down the execution of the code, etc.;
and, (2) parallelizing or pipelining the algorithm functions by exploiting the natural
capabilities of custom-logic architecture and design. The latter provides far more work
per cycle than code written in a native instruction set on a general-purpose
microprocessor. As discussed earlier, we believe a speed up of up to two orders of
magnitude should be possible with this approach. Furthermore, the bottlenecks within
the algorithms could be avoided by exploiting the underlying hardware resources in
reconfigurable machines to optimize specific parts of the algorithm’s execution that
general-purpose machines cannot offer.
We select a particular phylogenetic distance method algorithm for this
research, namely the UPGMA (Unweighted Pair-Group Method with Arithmetic means)
algorithm whose computational complexities are described below.
2.2 The UPGMA
UPGMA has relevancy beyond phylogenetics, since it is a hierarchical clustering
method that is both fast and useful with gene expression or micro-array data. The
algorithm’s running time complexity is evaluated and compared against with that of the
hardware implementation and the results are presented in Chapter 6. The value of N is
typically around 10,000 to 50,000 in micro-array applications. Thus, even though
software-based phylogenetics applications run this method at a rate of 1 second each [15],
for N = 100, there is an increase by a factor of perhaps 10,000 in micro array applications
even before we consider memory bottlenecks. This last factor causes considerable
problems, since memory usage also scales as O(N2). Thus, this problem might take days
to complete with larger taxa data sets. This algorithm is well understood [11, 14, 15], and
the software solutions have reached a level of optimization beyond which minimal
performance improvement can be obtained. Thus UPGMA is an appropriate candidate
for exploring an implementation on a reconfigurable platform using custom-logic
architecture and design techniques.
2.2.1 Algorithm
We define the distances between two clusters Ci and Cj to be the average distance
between pairs of sequences from each cluster:
dij = (1/|Ci||Cj|) dpq (1)
where |Ci| and |Cj|denote the number of sequences in clusters i and j, respectively
and p and q denote the sequences in each cluster Ci and Cj respectively. If Ck is the
union of clusters Ci and Cj, and if Cl is another cluster, then
dkl = (1/|Ci||Cj|)(dil|Ci| + djl|Cj|) (2)
This forms the average distance calculation for obtaining the distance of the new
cluster Ck to the any other cluster Cl.
The distances are represented in the form of a matrix given below in Figure 3 with
each row or column corresponding to one node. The nodal distance between node i, j
would be in the position [i, j] of the matrix. So D[i, j] would form the distance
between nodes i and j.
Figure 3. Distance Matrix
D[i, i] is not a valid distance since there can be no distance between the same
node. This is therefore marked as “x” in the matrix.
The steps of UPGMA algorithm are as given below [14]:
1. Initialization:
a. Assign each sequence i to its own cluster Ci,
b. Define one leave each T for each sequence, and place at height zero
2. Iteration:
a. Determine the two clusters i, j, for which dij is minimal. (if there are
several equidistant pairs pick one randomly.)
b. Define a new cluster k by Ck = Ci U Cj, and define dkl for all l by (2).
c. Define a node k with daughter nodes i and j, and place it at height
dij/2.
d. Add k to the current clusters and remove i and j.
3. Termination:
a. When only two clusters i, j remain, place the root at height dij/2.
2.2.2 Complexity and Bottlenecks on UPGMA
We believe the UPGMA algorithm has two bottlenecks. The first is in deciding
which of the N(N-1)/2 pairwise distances is minimal at each step of the star-
decomposition clustering. Following this, the data matrix is reduced by dimension 1, due
to clustering of two objects. This introduces the second bottleneck, the need to calculate
an average distance between the two objects (i and j) as a single cluster (k) and all other
objects. This involves complex computational units, costly on general-purpose
microprocessors, but which we believe can be implemented efficiently on reconfigurable
custom logic FPGA device, giving better performance results.
This research examines the function of the UPGMA algorithm, implementing it as
a custom logic architecture. The standard HDL-based design methodology is employed
in that we model the algorithm using the VHDL hardware description language, we
functionally verify the algorithm’s correctness in the custom logic architecture, and then
we synthesize the architecture onto a set of resources to produce a circuit mapped to a
target FPGA device’s component library. The resulting circuit is implemented on a
Xilinx Virtex E® FPGA device and is subjected to functional and performance analysis.
However, before we can present the research method undertaken in this effort—including
the analysis, architecture and design of the circuit implementing the UPGMA algorithm,
we must discuss the characteristics of FPGA devices and their use in reconfigurable
computing that gives this research a high chance of success.
2.3 Field Programmable Gate Arrays
The evolution of the FPGA devices is evidenced by the great strides in the
underlying technology—effective logic gate counts in the millions of transistor gates, the
ability to download and alter the logic via a programmable bitstream while the FPGA
device is in operation—to name a few. Several companies have been developing high-
performance, high-capacity FPGA devices, targeting larger applications such as those
associated with scientific computing. FPGA vendors such as Xilinx, Actel and Altera,
the largest of those producing these devices, have a leadership position in the market.
Our reconfigurable platform, the Annapolis Wildcard® system uses a Xilinx® XCV300E
device. The Xilinx FPGA devices have a standard set of device architecture features,
similar to the one shown in Figure 1 in the previous chapter. We describe the
architecture for the Xilinx XCV300E device below.
Figure 4. Structure of the Xilinx® XCV300E device.[16]
Figure 4 provides an architectural overview of the XCV300E device. There are three
main components in the device, which are: (1) Input output Blocks (IOB), (2)
Configurable Logic Blocks (CLB), including block-programmable RAM (BRAMs)
memory structures; and, (3) the Programmable Routing Matrix.
2.3.1 Input Output Blocks (IOBs)
The input and output blocks on the device provide and interface between the input
and output pins and the Configurable logic blocks (CLBs). The architecture for these
blocks is given in Figure 5. These blocks provide three storage elements that can be used
either as edge-triggered D flip-flops or as level sensitive latches.
Figure 5. Virtex E Input Output Block architecture [16]
2.3.2 Configurable Logic Blocks (CLBs)
The Configurable Logic Blocks provide the functional elements for implementing
logic. The basic building block of the CLB is the Logic Cell (LC). Each Virtex-E CLB
consists of four LCs. The LC consists of a 4-input function generator, carry logic, and a
storage element. The output of the function generator in each LC drives the output of the
CLB and D input of the flip-flop. The architecture for a Virtex E CLB is given in the
Figure 6. The four LCs are organized as two identical slices as shown in the figure.
Figure 6. A Two-Slice Virtex E CLB [16]
The function generators are implemented using Look Up Tables (LUTs) that can
also be configured to be as 16x1 bit synchronous RAM. The two LUTs in a slice can be
combined to create a 16x2 bit or 32x1 bit synchronous RAM, or as a 16x1 dual-port
synchronous RAM element.
2.3.3 Programmable Routing Matrix
The Virtex-E consists of a General Routing Matrix (GRM) that connects the
CLBs together to implement the logic chains. The GRM comprises an array of routing
switches located at the intersection of the horizontal and vertical routing channels. Each
CLB also has local routing resources through which it connects to the GRM. These local
and global routing resources can be programmed to generate the best routing for the
design being configured onto the device. The Xilinx configuration tools take care of the
placing and routing the design onto the device’s resources through user-specified
constraints.
2.3.4 Resources on a Virtex-E chip
The Xilinx Virtex-E resources and their numbers are given below in Table 1:
Resource NumberCLBs 1536Slices 3072LUTs 6144
FlipFlops 6144Block RAMs 256x16-bits 32Block RAMs 256x32-bits 16
Block RAM bits 131072
Table 1: Virtex-E chip resources
Each CLB has two slices and there are two LUT’s and two flip flops per slice.
The Block RAM allocations are based on the how the LUTs are configured. If they are
configured as two 16x1 bit RAM then we can have 32 of the 256x16 bit block RAMs
implemented on the device. If we have two LUTs configured to form a 32x1 bit RAM
then we can have 16 of the 256x32-bit RAMs implemented on the device.
The Xilinx Virtex-E data manual [16] provides a detailed description of the device
architecture along with pin definitions and electrical characteristics. The Virtex E device
with its full complement of resources, provides the designer with a total of 411,955
CMOS transistor gates. This device can thus be used to implement reasonably sizable
designs running at moderately high clock speeds.
2.4 Reconfigurable Computing
The constant improvement in FPGA device density and performance has
prompted many to look at using these devices for implementing high-performance
computing applications. The traditional advantages these devices provide are that they
can be configured and reconfigured with little extra cost (except if reconfiguring during
application runtime) and ease through direct host program control. The increase in gate
count and speed of these devices has also made them an appropriate target for building
high-performance, custom computing machines. These machines, also referred to as
reconfigurable computing machines, provide flexibility to program and reprogram
systems and at the same time provide high performance computing at a relatively low
cost when compared to price-performance models of other high-performance platforms,
such as supercomputers [2, 5, 6, 7, 8, 9]. Several computing platforms consisting of
arrays of FPGA devices have been developed through research and experimentation and
are currently commercially available in the market.
The DEC Paris Research Laboratory’s Programmable Active Memories (PAM)
project was one of the earliest pioneers in reconfigurable computing [1]. The PRL team
implemented the RSA encryption algorithm at speeds that had never been achieved,
beating supercomputers and even custom discrete IC applications at that time.
SPLASH and Splash 2 are two other reconfigurable architectures developed in the
early nineties—Splash 2 being an upgrade of the original SPLASH architecture [17]. The
Splash 2 consisted of 16 printed circuit boards, each consisting of 17 Xilinx XC4000
FPGA chips per board. Each XC4000 had its own memory banks, to which it could
independently read and write. A number of high-performance scientific applications
were implemented on Splash 2, such as in the domains of gene sequence matching,
fingerprint matching and image processing--at speeds of two orders of magnitude greater
than the fastest supercomputers at that time [17].
Several companies have brought commercial platforms to market over the past
few years, attempting to exploit this new computing model. Annapolis Microsystems
[18], SRC Computers Inc [19], and Star Bridge Systems [20] are three of the most
prominent players in this market. These new reconfigurable architectures are being
marketed as platforms that can be used for implementing a wide range of applications
from different domains. The Annapolis WILDCARDTM reconfigurable platform that we
are using on our research is one of these—albeit a low-end version.
The research described in this thesis examines the architecture, design and
implementation of the computationally intensive UPGMA algorithm on a low-end
reconfigurable platform and evaluates the performance as contrasted with that obtained
by an implementation of the same algorithm using conventional software program
execution on a standard Intel® CPU-based personal computer.
As discussed, we have chosen the UPGMA phylogenetics algorithm as the
application domain in which we will explore the architecture and design space, and
subsequent performance differences, of applying the reconfigurable computing paradigm
to this scientific computing problem. Phylogenetics provides us with a rich variety of
problems with complex computational tasks that can be studied to see if they can be
implemented on reconfigurable machines. Furthermore, the software domain has already
been thoroughly explored, and few performance gains can be realized from further
software optimization of the UPGMA algorithm in particular.
With this rationale clearly in mind, we progress to our discussion of the problem-
solving and analysis of the UPGMA domain to derive a suitable high-level architecture
with which to implement the algorithm. In addition, we’ll need to play our architecture
off against the resource and timing constraints of the underlying Xilinx device and the
WILDCARDTM platform (including its mechanisms for interfacing with the PC-based
host system in which it resides).
CHAPTER 3
DISCUSSION OF THE WILDCARD CUSTOM COMPUTING PLATFORM
In this chapter, we discuss the reconfigurable computing platform we have
available to us for purposes of this research. As part of our analysis, we had to
thoroughly analyze this platform in order to understand its operating environment, its
programming model, and its key features and constraints. All of this was required prior
to devising an architecture for our UPGMA solution, because any such architecture
would both be constrained by the resource constraints of the Xilinx device resident on the
WILDCARD, but also we would be further constrained by the programming model and
execution environment provided by the vendor for us to realize our solution.
3.1 The Annapolis WILDCARDTM System
The WILDCARDTM system comes as a PC card and can plug into a PCI/PCMCIA
card slot adapter, making it a very portable low-end reconfigurable platform. It has a
very compact architecture, with a single Xilinx Virtex XCV300E processing element
(PE) and a couple of independent memory modules on the either side, forming the core of
the system. The architectural block diagram is given in Figure 7 below.
Figure 7.The WILDCARDTM Platform Block Diagram [18]
Each of the two memory blocks, referred to as the Right and Left memory banks,
is a 64K x 32-bit RAM module, with a 19-bit address bus and a 32-bit data word. The PE
can write and read from the right and left memories independently. The host interface is
through a 32-bit CardBus (PCMCIA) controller that operates at a 33 Mhz clock
frequency. The CardBus controller interfaces with the PC host through the PCI Bus
interface, and with the PE through the LAD Bus interface.
Data transfers to and from the PC host are done through control of a set of C
program driver calls that interface with CardBus controller which, in turn, interfaces with
the LAD Bus to send data to, and retrieve data from, the PE. Data can be written from
the host to the memory through these specific interfaces by making the C program calls
provided by the Host Programming Application Programming Interface (API) provided
by the vendor.
The PE also has certain Input and Output pin connections that enable it to connect
to external devices. These pins are helpful when you want your application program to
communicate with an external device.
The WILDCARDTM board has a Frequency synthesizer that generates one main
global clock signal, F_Clk, which is used to derive three other global clocks, namely the
P_Clk, M_Clk, and K_Clk. The user can set the clock frequency of the F_Clk using a C
routine call from the host. The P_Clk is the PE clock pad signal, and is set to half the
frequency set for the F_Clk by the user. The M_Clk the memory clock pad signal and
operates at the same frequency that is set by user for F_Clk. Finally, the K_Clk is the
CardBus/LAD Bus clock pad signal, which always operates at 33 MHz.
3.2 The WILDCARDTM System VHDL model
The WILDCARDTM system software package provides VHDL models for the
whole board that can used to create a VHDL-based program model and to implement and
debug the whole reconfigurable application design. The VHDL model of the system also
contains a simulation model of the host system that is used for testing the application
from the perspective of custom computing hardware-software co-design.
The VHDL model provides interface components that are used to access all the
components on the WILDCARDTM system. There are two types of interface components,
namely, the Standard Interfaces, and the Mux Interfaces [18].
Standard interfaces are simple interfaces to the devices (PE, memories) on the
system and can be used for low level, specifically tuned applications. The Mux (or
multiplexing) interface can be used for programming at a higher-level interface between
the LAD Bus, Memory and the PE components. Both of these interfaces allow multiple
user application components to share a single resource (such as the LAD Bus or the PE’s
memory banks).
The development environment provides VHDL models for the following platform
components, which are used for early model integration, hardware-software partitioning
analysis, and functional verification and clock cycle-level timing analysis.
Processing element (PE)
Right Memory Bank
Left Memory Bank
Host
Clock Generation
Input and Output Connectors
PCI controllers
The PE VHDL model is a standard VHDL entity-architecture pair. The entity defines
the input and output pads of the PE device. The pad numbers are logical and do not
match with the physical pin numbers of the FPGA die. This entity definition is fixed and
is used as a template for the physical PE while creating the application design. In our
preparation of these components for exploring the space of possible architectures for the
UPGMA algorithm, we do the following: the PE architecture template is modified
accordingly to embed the application design within it. Furthermore, the Standard and
Mux interface models are used in such a manner in order to embed the application design
interface with the LAD Bus and the memory banks. Finally, this allows us to take the
resultant composite PE model and synthesize the PE image for actually configuring the
WILDCARD device.
The VHDL models provided for the memory banks, host, clock generation, I/O
connectors and the PCI controllers are purely for simulation purposes. These models are
used within the WILDCARDTM simulation model and encapsulate the system’s
functionality for use in VHDL simulation, enabling us to functionally verify the PE
designs, as well as validate the correctness of the UPGMA algorithm, before synthesizing
the actual design units and placing and routing them onto the Virtex device resident on
the WILDCARD.
3.3 WILDCARDTM Host Programming
The WILDCARDTM system is composed of three main components, listed as
follows: (1) the WILDCARDTM board; (2) the WILDCARDTM device driver; and, (3) the
Host Application Programming Interface (API). The WILDCARDTM Software Design
Hierarchy is given in Figure 8. The Host programming is done in C language using the
standard Host API routines to communicate with the WILDCARDTM board through the
Windows® based device driver.
The device driver provides a low-level hardware interface to the WILDCARDTM
board. When the driver is called in the appropriate set of driver function codes, it
initializes the WILDCARDTM in a sequence of steps by reading its configuration and
establishing handler interfaces for memory, interrupts and DMA operations. The
WILDCARDTM API presents a generalized view of the hardware resources and control
operations. The following operations are performed by calling the API routines:
Opening and Closing the WILDCARDTM board
Clock control (frequency)
Processing Element control (program, reset, register space)
Memory Interfaces (read/write)
Interrupt control (PE/FIFO enable/disable)
The C function routines for each of the above operations are discussed below.
Figure 8. The WILDCARDTM Software Design Hierarchy [18].
3.3.1 Opening and Closing the WILDCARDTM board
The host program first makes an “open” call to the WILDCARDTM board before
performing any other operations. This initializes the device driver, which, in turn,
initiates the interface handlers for access to the board components. The C routine for this
is WC_Open( ). The counterpart for the WC_Open( ) function is the WC_Close( ). For
every WC_Open( ) there should be a corresponding WC_Close( ) function call to ensure
a clear disconnect and proper de-allocation of resources.
3.3.2 Clock Control
The only clock control operation a host program can perform is setting the
frequency of F_Clk. The function call for this is WC_SetClkFrequency( ). The
programmable clock module allows user programs to change the clock frequency
anytime by calling this routine.
3.3.3 Processing Element and Interrupt Control
The four main operations that are executed against the Processing Element (PE)
are: (1) the PE Reset; (2) the PE Program; and, (4) the PE Register space read and write
(3) the PE Interrupt control.
The PE Reset operation is used to reset the PE and the embedded application
residing on it. The PE program function calls are used to program the PE device with the
user-designed application. There are two function calls: (a) PE_ProgramFromBuffer( ),
which is used to program the PE from a user buffer space; and, (b) PE_Program( ), which
is used to program the PE from a file.
The PE has a certain register space to which we can read and write to. The
register space has an address range of 0x04000 to 0x0FFFF. The two function calls to
read and write to this register space are: (a) PE_RegRead( ) - reads from the PE register
space locations; and (b) PE_RegWrite( ) – writes to the PE register space locations.
When a PE interrupt occurs, the device driver immediately masks the PE interrupt line
and informs the calling program that is suspended on the API call that an interrupt has
occurred The Interrupt control is done using the following functions [18]:
WC_IntQueryStatus( ) – checks the status of the PE interrupt line via polling.
This is useful when the host program is written to do other operations while
waiting for the interrupt.
WC_IntWait( ) – waits for the PE interrupts; useful when the host only needs to
wait for the PE interrupt before proceeding to perform anything else. The calling
program is suspended.
WC_IntReset( ) – after the host program has processed the interrupt, it can reset it
and clear the API’s indication of a pending interrupt.
3.3.4 Memory Control
There are two main memory control API calls that are made by the host C application
program. They are: (1) WC_MemRead( ), which reads from the right or the left memory
SRAM banks; and, (2) WC_MemWrite( ), which writes to the right or the left memory
SRAM banks. The calling arguments include the memory bank identifier, the base
address and size of the block of data (that is, the number of DWORDS) to be written or
read [17]. The function calls invoke the device driver that, in turn, manages handlers to
transfer data through the PCI bus through the CardBus/LADBus interface to the
WILDCARDTM.
3.4 PE Embedded Application Initialization
The host program must proceed through a set of steps for initializing the application
in the PE device before the reconfigurable application can be started. The steps are as
follows:
Open the WILDCARDTM board;
Initialize the clock by setting it to a particular frequency;
Enable the PE Reset line – ensures the PE is reset atleast once when the clock
starts;
Disable and clear any pending interrupts left hanging by any previous
applications;
Load the PE image by calling the PE_program API routine;
Execute any additional initialization tasks as necessary, such as for enabling PE
interrupts; and,
Disable the reset lines to allow normal operation of the PE.
Now the downloaded UPGMA application is running on the WILDCARDTM
proceesing element, and the host program can start its portion of the application
processing activity, which consists of transferring taxa and phylogenetic tree data to and
from the offloaded UPGMA algorithm application.
CHAPTER 4
CUSTOM COMPUTING DESIGN OF UPGMA
In this chapter, we present the design methodology employed to create the custom
computing application that offloads the UPGMA algorithm from the Host onto the
WILDCARD board for accelerated processing of taxa data.
4.1 VLSI Design Flow
New VLSI design methodologies have emerged every four or six years. Hardware
description languages and EDA tools have made it possible to design VLSI systems at
higher levels of abstraction. Hardware description languages provide chip designers with
the capability of describing the functionality of a design at a higher, more abstract level
of representation than a gate level representation. VHDL and Verilog HDL are the two
languages used in the industry for hardware modeling and implementation.
Figure 9 shows a HDL-based design process model representing the various design
activities. A brief description of each of the activities in the design process model is given
below:
System specification is an activity to abstract design information from a problem
statement and defining the interface and timing waveforms of the system.
System partitioning is an activity to hierarchically decompose a system to handle
complexity based on system specification, design resource, and feasibility of
implementation. Components at the final hierarchical level should facilitate
behavioral modeling using HDLs or allow for HDL component reuse. The output of
this activity is a valid system partition.
Figure 9. An HDL-based design process model[21]
Modeling or adaptation involves capturing a design component in HDL with
high-level timing information and data dependencies or adapting reusable HDL
components from a design library.
Component simulation verifies functional behavior and high-level timing of each
component using HDL test benches or cycle-based simulation techniques.
System binding is structural integration of simulated components based on the
system partition. This activity produces a system model for verifying system
specification.
System simulation verifies system behavior and timing using HDL test benches or
cycle based simulation techniques. This activity produces simulation results that
can be verified with system specification.
Logic synthesis is obtaining a gate-level netlist using an automated synthesis tool.
Removal of timing information and non-synthesizable constructs, technology
mapping and defining area and timing constraints are involved in this activity.
The target ASIC library and constraints are chosen to comply with system
specification.
4.2 UPGMA Project Design Flow
The design methodology described above forms the basis for building the
application design. The UPGMA algorithm undertaken for this project forms the
problem statement. It was analyzed and a system specification generated. The design was
partitioned into easily manageable blocks and modeling of each of these modules was
done in VHDL. The top-level hierarchical model is structural and merely connects the
different sub modules to form the final top-level design. We have employed a certain
amount of adaptation by reusing Xilinx cores to implement certain modules in the design.
This was done mainly to ensure that the design use no more resources than were available
in the Virtex XCV300E chip of the WILDCARDTM board.
The component simulation was conducted on each of the sub modules to verify
their functionality. This step eases the final system level simulation process as the errors
within the sub modules are removed by then. The final top-level structural model was
then written and system simulation conducted to verify the functionality of the design as
a whole. The ModelSim simulation environment has been used for debugging and testing
purposes.
Logic synthesis was conducted mainly for identifying the critical paths and
finding the resource usage of the design. The Synplify Pro® 7.3 FPGA synthesis tool was
used for this purpose. The synthesized gate level netlist does not entirely form the final
design being implemented on the Processing Element (PE) of the WILDCARDTM system.
The functionally verified design was embedded within the PE VHDL architecture model.
The PE model was then placed within the WILDCARDTM system simulation model and
the final functional testing was conducted. The verified PE model was then synthesized
and the EDIF netlist generated. The EDIF netlist was then placed and routed using a
make file provided by the WILDCARDTM system. The make file provided calls to the
Xilinx place and route tools for generating the final PE image that is used to configure the
device. This image is then used to proceed with the WILDCARDTM host programming
process shown in Chapter 3.
4.3 UPGMA Design
The design architecture was formulated keeping in mind the parameters that
govern the data sizes upon which the design operates. The data bit-width size constrains
the bit widths of registers, the bit widths of datapath path elements and the memory
requirements.
We first look at defining these parameters and then move towards describing the
design architecture.
4.3.1 Design Parameters
For a taxa size of n:
The number of nodes that form the final tree are n + (n-1)
The number of distance values that need to be stored is
o (n(n-1) + (n-2)(n-3))/2
o For n nodes there are n(n-1)/2 combinations, thus making the number
of nodal distances to be n(n-1)/2.
o When an internal node is formed the number of nodes still left to be
connected reduces by 2 for the very first internal node and then by 1
thereafter. Initially there are n external nodes. When the first internal node
is formed by joining two external nodes, the number of nodes left is n-2.
So we need to compute the distance of the new internal node to n-2
different nodes. Thereafter, for each internal node formed we have a
reduction of 1 node making the number of nodes to connected to be n-3,
n-4 and so on. The total number of distances for internal nodes is thus n-
2 + n-3 + n-4 + …… which is equal to (n-2)(n-3)/2.
We define below the data structures used to representing the input, intermediate and
output data. There are four types of data upon which the algorithm operates:
Distance data – A simple 32-bit dataword is used to represent this value. The
distance data is stored in the left memory bank that has a 32-bit data width. Thus
the choice of a 32-bit word for representing the input distance data.
Nodes – The nodes that form the tree are represented as 10 bits each. The choice
of 10-bits was made as initially the design was targeted to implement a 512-taxa
dataset.
Heights – Each node has a height associated with it. This indicates the number of
nodes beneath it. This values has been represented in 16-bit format. A 512 taxa
tree would have a root with the largest height of 512. An initial size of 9-bit was
selected in earlier versions of the modeling but was changed to 16-bits when the
16-bit Block RAMs were chosen to implement the Height memory that is used to
store the heights of the nodes.
Tree Output Data – This data represents each node in a tree with its parent node
and branch length to its parent associated with it. The format used is given below:
Node ID- 10-bits Parent ID- 10-bits Branch length- 12-bits
This format was used to connect up the nodes while generating the final tree. The
total bit length is 32-bits. The tree output data is stored in the right memory that
has data word length of 32-bits
4.3.2 Design Datapath
The basic datapath of the design that is used for calculating the average distances
and also obtaining the minima is given below in Figure 10 . The datapath is broken into
two, one used for finding the least distance or the minima and the other for calculating
the average distances. The datapath on the left of the Figure 10, with the less than
operator is used to find the minima. The datapath has the distance value and the current
minima as inputs. The current minima is stored in a register and is fed back into the
comparator. The second datapath on the right is used to calculate the average distance. It
has the distance value dik and the height of node i as inputs. The multiplier obtains the
product of this height and the distance and sends it to the adder. The adder adds this value
to an accumulator. The multiplier and adder together obtain the numerator part of the
average distance equation (2) given in Chapter 2. The same time that the multiplier
accumulator are computing the numerator, the second adder which takes height Hi as
input, computes the denominator of the equation (2). When these two are computed the
resulting values given below are sent to the divider to obtain the average distance.
Numerator = (dikhi + djk)
Denominator = hi + hj
Average distance = (dikhi + djk)/(hi + hj)
Figure 10. Design Datapath
4.3.3 Design Architecture
The architecture of the design is given in the block diagram shown in Figure 11. The
architecture shown is the final one created after several design passes. The two main
components that form the backbone of the design are:
The Controller
The Address Generator
Most of the design effort has been put into modeling these two components since the
controller executes the working of the UPGMA algorithm and the address generation
forms the core part of the algorithm’s process. The description of these two components
along with that of the other sub components is given below. We start with the simpler
ones and then proceed later to the complex components. The datapath components are
dealt with first, then moving to control path and finally the memory modules.
4.3.1 Adder
The Adder is modeled using the simple VHDL “+” operator. It has two data
inputs and one output, each 32 bits wide.
The basic architecture of the 32-bit Adder would be realized using the Ripple-
Carry design. This style of Adder architecture has its carry chain as its critical path;
however, for this bit-width, the tradeoff in area versus speed was not significant enough
to warrant exploring other, more sophisticated Adder architectures.
Output Selector
Controller
Multiplier
HAdder
AddregMulReg Adder
HAddReg
Comparator LeastDistanceReg
RowReg
ColReg
Divider
Address Modification
Divreg
HeightMemory
LstDinp
MuloutAddinp
Din
pAdderreginp
Dis
tinp2
Dst
inp1
HRead
HWrite
Sel
ect
Sig
nals
Col
Reg
En
Row
Reg
En
LstDstRegEn
Mem
Rea
d
Mem
Writ
e
DivRegEn
HeightRegEn
AddRegEn
Address, Read and Write signals to Leftand Right Memory Banks
AvgDistanceOpt
Tree output Data
Dat
a In
puts
to L
eft a
nd R
ight
Mem
ory
Ban
ks
MulRegEn
Height
Rowinp
Colinp
Address Generator
mad
r
mem_addr_modified
HAddrRegVal
Data_in
Address Generate Control
number_of_species
Row
Col
control signals
Clk
Reset
text
Avg
Dis
t
Figure 11. Block Diagram of UPGMA Architecture.
4.3.2 Add Register
The Add register stores the output value of the Adder module. The register value
is fed back into the Adder so that these two components work together as an accumulator.
The Adder input is added with the old value stored in the Add register and the cumulative
value is stored into the register. The final value of the Add register forms the numerator
of the Average distance equation given in Chapter 2. The Controller module sets the
enable and clear signals for the register.
4.3.3 Height Adder
The Height Adder is similar in architecture to the basic Adder and has two inputs
and one output, each of which is 16 bits wide. This module is used to add the heights of
the nodes of the tree.
4.3.4 Height Register
The architecture of the Height Register is similar to the Add Register, except that
it is 16-bits wide. It stores the output value of the Height Adder and its value is fed back
into the Height Adder so that they work together as an accumulator. The Height
Register’s final cumulative value forms the denominator in the average distance equation
given in Chapter 2.
4.3.5 Multiplier
The multiplier is a simple 32-bit multiplier modeled using the standard VHDL “*”
operator. The multiplier has two inputs and one output, each 32 bits wide.
4.3.6 Multiplier Register
The Multiplier register stores the Multiplier module’s output. The register
component’s architecture and pin configuration is the same as that for the Add Register.
From the standpoint of the VHDL model, a single generic entity-architecture description
is employed for all the 32-bit register units.
4.3.7 Divider Unit
The Divider was modeled using the shift-subtract division algorithm. The divider
forms the critical path of the design, and the controller coordinates its operation to ensure
that the design runs at the requisite clock rate. The Divider has two 32-bit inputs and has
as outputs a 32-bit quotient and 1-bit “valid” flag.
4.3.8 Divider Register
The Divider register simply stores the output value of the Divider unit. Its
architecture and pin configuration are similar to that of the Add and Multiplier registers.
The controller handles the enable and clear signals.
4.3.9 Comparator
The Comparator is used to compare the distance values and find the minima. The
comparator is modeled using simple VHDL “>” operator. The comparator compares the
new distance value with the previous minima stored in the Least Distance Register; when
it finds new minima, it enables the Least Distance Register, Row Register and the
Column Register.
4.3.10 Least Distance Register
The Least Distance Register stores the current minima while the algorithm
continues to search for the minima of all the distances. The register architecture is
similar to that of the generic 32-bit register with clear signal being set by the controller
and the enable signal being by the comparator.
4.3.11 Row and Column Registers
The Row and the Column Registers store the Row and the Column values of the
current minima in the distance matrix. Each distance matrix is accessed through the row
and column values. These two values are used to obtain other distance values while
calculating the average distance value. The comparator sets the enable signal and the
controller sets the clear signal controlling these registers.
4.3.12 Controller
The Controller forms the core of the design, making its behavior one of the most
complex to model. Its operation is based on the processing steps of the UPGMA
algorithm. The steps for the controller are given in Figure 12, below.
Figure 12. The Controller Algorithm.
The three main operations being performed by the controller for every single pass
through the matrix are as follows:
Find the new minima
Compute the Average distance
Reduce the matrix size
The controller accomplishes this by stepping through a set of states and repeating the
process until all the nodes in the tree have been handled.
4.3.13 Counter Units
The counters form some of the sub components of the address generator block.
The counters are used to select the next address to be generated. The generic block
diagram for one of the counters is given below. The counter counts up until the count
value becomes equal to the compared value (CV); at which point they set a “great” flag,
indicating that the count has been exceeded; the counters are then reset back to zero. The
controller sets the Increment and Clear signals.
4.3.14 Multiplexers
The multiplexers enable the address generator in selecting the next address. We
have 2:1 and 3:1 multiplexer architectures for this purpose.
4.3.15 Address Generator
The address generator is a module that underwent several design cycles. The
address generation algorithm is not simple, when considering it from a hardware
perspective. The address generation is conducted for two operations in the design
process.
Finding distance minima in an instance of the matrix; and,
Calculating average distance
For finding the minima, the address is generated for fetching the next distance value
from the memory. The address is generated in the format of “row&column”, with the
row and column values concatenated together to represent the actual memory address.
The row and column values represent the row and column of the node-to-node distance
matrix, with each row or column representing a node.
For example, node 1 to node 2 distance can be fetched by concatenating
“0000000001” with “0000000010” to generate the 20-bit address
“00000000010000000010” before actually accessing that memory location. These two
values are obtained by reading a memory that stores the currently active node values.
The earlier version of this memory structure, referred to as “node memory” was
modeled behaviorally and later modified to reduce the size of the module. The earlier
model used to take more resources than were available in one full Virtex XCV300E chip.
The model was later modified by implementing the node memory using 256 x 32-bit
Xilinx Block RAM that would use one single Block RAM resource on the Virtex
XCV300E chip. This reduced the resource consumption significantly, making the
module work much more efficiently.
The Xilinx Block RAM is a dual port memory structure [16], structure so that two
different addresses can be written or read, or read and written, in combinations at the
same time. This enables us to read the two different node values at the same time in order
to generate within one clock cycle the concatenated address. The block diagram of the
dual port Block RAM is given above.
The counters, discussed earlier, are used to generate the address values, ‘addra’ and
‘addrb’, for selecting the nodes to generate the next address. After all the distances are
read and compared, the minima is stored in the Least Distance Register.
The average distance calculation needs generation of the address by selecting nodes
that have not yet being joined into a cluster. The nodes that have currently been selected
to form the new cluster are stored in the Row and Column registers. The average distance
equation is given below.
Avg. Distance D(x, y) i = (HxDxi + HyDyi) / (Hx + Hy)
The x, y are the new nodes selected to form the new cluster, i is the node to which
the distance from the new cluster is being calculated, Hx, Hy are the Heights of nodes x
and y respectively, and Dxi, Dyi are the distance of nodes x and y to the node i,
respectively. The address generation for calculating the average distance is done in steps
outlined as follows:
The address for obtaining Dxi is generated first by selecting node i’s value and
concatenating with node x’s value stored in the Row Register;
The address for obtaining Dyi is generated second by selecting node i’s value
again, and concatenating with node y’s value stored in the Column Register;
These two distances are added to calculate average distance D(x, y) i ; and,
This new distance is then stored into the memory for future reference. The new
cluster forms a new node, lets say j, and the average distance calculated
represents the distance of this cluster to node i. Thus, the address for storing this
distance would be cluster j’s value concatenated with node i’s value.
The above four steps are performed for all of the nodes that have not yet been
connected to the tree. After the average distances of all the nodes have been calculated,
the node memory is updated by removing the nodes that have been selected to form the
new cluster. The complexity of the process lies in maintaining the node memory and
stepping through the process of selecting memory’s addresses to obtain the next address.
4.3.16 Output Generator
This module selects the output values that form the output tree data that is written into
the memory. The outputs that this module are listed as follows:
Node type – internal or external leaf node;
Node ID- the value of the node;
Parent ID- the value of the parent of the node; and,
Branch distance- the distance of the node to its parent.
4.3.17 Height memory
The Height memory holds the heights of the nodes in the tree. The external, or
leaf, nodes would have a height of one and the internal, or nodes with children nodes,
would have heights of two or more. The memory architecture in the earlier version of
this design was implemented behaviorally and the post synthesis results yielded high
resource usage and slow performance. Thus, after a design review, this module was
implemented using four 256x16-bit Xilinx block RAMs and additional write and read
logic associated to control the Block RAM accesses. The module now uses only four
Block RAM resources on the Virtex XCV300E chip.
4.3.18 Off-chip Memory Banks
The right and left memory banks on the WILDCARDTM are used for storing the
distance and tree data respectively. The banks are 64Kx32-bit SRAM modules, where
access to them is managed through interface modules provided by the WILDCARDTM
system. These components are available as VHDL models that can be used depending on
the needs of the application. If we need to write to left and right memories we need the
interface components provided for these two banks. The application on the PE sends a
read or write request through these interface components. The components allow multi-
processing such that multiple applications could read and write to the memory at the
same time. The read and write requests are prioritized and the designer can chose the kind
of prioritization used. This feature has not been used as we have only one application
design running within the PE that reads and writes to the memories.
The read and write operations take certain cycles to be performed successfully.
Figures 13 and 14 give the timing diagrams for the read/write to/from the memory. The
read cycle is such that the first data takes 4 clock cycles to arrive after the read signal is
set.
Figure 13. Typical read cycle from memory[18]
Figure 14: Typical write cycle from memory[18]
As we can notice from Figure 13, the write takes only one clock cycle to be
performed. The 4 clock cycle latency requires the controller to wait until the first data
arrives before performing the datapath operations. The memory read thus causes a certain
latency in the design and most certainly affects the performance. For larger taxa sets, this
latency gets larger and drastically affects the speed of the design.
4.3.19 Addressing Schemes
The WILDCARDTM memory banks both have a maximum capacity of 65536
words of 32-bits each. This small memory capability not only limits the number of taxa
that could be implemented on this system but also affects the way the memory is
addressed. The addressing scheme discussed in Section 4.3.15 is not effective when the
number of taxa increases to say 256. The scheme uses the following methodology. Let us
assume we have a distance matrix given as Figure 15 below.
Figure14: Distance matrix
Figure 15. Distance Matrix
The distances 6, 8, and 3 are addressed by D[0, 1] D[0, 2], D[0, 3]. Thus, while
writing data from the host, we place these three values at indexes 1, 2, 3 of the array and
transfer the data to WILDCARDTM memory. These values would be written in addresses
1, 2, and 3 of the memory. Thus value 6 in address 1 of the memory can be referenced
using the address “00000000000000000001” obtained by concatenating the nodes
“0000000000” and “0000000001” together, and similarly for locations 2 and 3.
Now, distance 7 is D[1, 2] and thus is written in index 1026 of the C array and
transferred to the memory. It can thus be referenced by generating the address
“00000000010000000010,” obtained by concatenating nodes “0000000001” and
“0000000010” together. These nodes represent the indexes of the matrix D. Thus we use
an addressing scheme that is similar to the way we reference the matrix values.
For datasets of 256 taxa or higher, this scheme fails, since for obtaining distances
between say, nodes 254 and 255, we have an address “00111111100011111111” which
represents a value much larger than 65536. Also, using this scheme, we are wasting
memory locations. For example, the consecutive distance values 6, 8, 3 are stored one
after another in locations 1, 2, 3, respectively, but the distance value 7 suddenly jumps to
the memory location 1026. This waste of memory locations would reduce as the taxa size
increases, but it is still unacceptable.
To avoid this problem and to be able to implement larger taxa datasets we have to
employ a linear addressing scheme. The catch in this scheme is that our design needs to
maintain a record of the node information while fetching every distance value, so that we
can know which two nodes have the minimal distance between them. Thus the address
generation using concatenation of nodes is important for the design. We therefore resort
to an address modification scheme in which we generate the address in the original
scheme and then modify it into a linear 16-bit address that does not go beyond our limit
of 65536.
The address modification is a complex process, takes additional clock cycles and
requires additional states in the control structure. This causes the design to slow down
and the performance is affected quite a bit. We will discuss the impact of the address
modification on performance in later chapters.
Let us assume in the address “node1&node2” node1 refers to the row of the
matrix and node2 to the column. Thus, for the matrix given in Figure 15, we have the
following mapping for each address value as given in Table 2. The number of taxa is n =
4.
Matrix format Linear format Number of values per row0-1 0
n-10-2 10-3 21-2 3 n-21-3 42-3 5 n-3
Table 2: Address mapping
From the above table we deduce that each row maps to a particular address. For
example 0 maps to 0; now for value 0-1 we have address ‘0’, for 0-2 we have ‘1’. So we
can see that for column value 2 the address 0 to which the row maps is incremented by 1.
Similarly, for 0-3, the address 0 is incremented twice. Thus we deduce that by mapping a
row to an address and adding the (column-1) value to it we obtain the linear address
that is used to obtain the required distance value. Row 0 would have (n – 1) values
and a base address of 0, thus the base address for row 1 would be (0 + n - 1) which
equals 3 for n = 4. Thus for address 1-2 we have an address of 3; for 1-3 we need to add
(column – row – 1) to obtain the correct linear address. Thus the steps used to
obtain the linear address are:
Obtain the base address to which the row maps to from a map memory
Add the value (column – row - 1 ) to obtain the final address
The above methodology is employed to perform the address modification. The base
addresses that each row maps to are written initially in Block RAM. This initialization of
the node memory, row map memory requires further additions to the controller states
and thus adds considerable delay to the design. This delay gets large for larger taxa
datasets. The address modification component is placed within the PE outside of the
UPGMA application component. The address generated from the UPGMA component is
fed to the address modification component, and the modified address is fed to the
memory interface components.
4.3.20 Top-level Block
The top-level block in the design provides integration and routing of all the sub
modules described above. The final top level design is then placed within the VHDL
model for the PE, as a sub component to that model, and is interfaced with the memory
and LAD Bus interfaces for handling the transfer of data.
The VHDL models for all the blocks in the design are listed in Appendix A. .
4.4 Design Verification
The design verification of the UPGMA design was performed using the
ModelSim simulation environment. The VHDL models of the WILDCARDTM system
provide a simulation model that could be used to run a host-based simulation. The
simulation was done for various taxa datasets. The benchmark dataset used was a 57-taxa
dataset for which we had the output resultant tree data generated from the software
implementation. The output tree generated by the software simulation of the hardware
design was compared with the benchmark data and found to match quite.
To verify the working of the hardware implementation on the WILDCARDTM
system the data generated from the hardware implementation was compared with the
benchmark data. Both the results matched perfectly.
Data generated through the test data generating software was fed to both software
and hardware designs and the resulting output was compared. We found that the two
outputs matched, indicating that the hardware design was working correctly.
CHAPTER 5
EXPERIMENTAL DATA SET AND PERFORMANCE MEASUREMENT
5.1 Experimental Apparatus for UPGMA
The WILDCARDTM host-programming environment provides us the capability to
program the WILDCARDTM system and also allows us to create templates that are used
to write the host program. Looking back at the software design hierarchy explained in
Chapter 3, we see that the host “driver” program is written in C. The WILDCARDTM
provides the API routines that are used in the host program to perform the following
functions: (1) read and write to the on-board SRAM memories; (2) wait for the Virtex®
PE to interrupt (or, alternately, poll the status register for completion of a WILDCARDTM
controller operation), and (3) process the results of the API-initiated operation.
The UPGMA host program was written based on the example templates provided by
Annapolis Microsystems® for setting up a custom computing application for reading and
writing the SRAM memory banks, reading and writing data to the Virtex® Processing
Element (PE) register space, and for processing PE interrupts. Using these examples as
guides, we created a complete host-based, experiment “driver” application, employing
the above three components, to perform the following host-to-computing server protocol
steps:
Initialize the WILDCARDTM system;
Program the PE from the image file;
Set the Clock frequency;
Enable PE interrupt line;
De-assert PE Reset line;
Read distance data from the file into a distance array;
Transfer distance data from the distance array to WILDCARDTM Left memory;
Write the value of the number of taxon being operated upon into a PE register;
This triggers the design to start running and assert “done signal” after it finishes;
The C program waits until the done is set;
Reads data from the Right memory; and,
After reading, it outputs the data into a destination file in the PC host file system.
The PE initialization includes “opening ” the board by calling WC_Open( ) routine,
applying power to the board, asserting reset lines, and clearing any pending interrupt
requests left unprocessed by previous application programs. Once the design has been
synthesized, and the EDIF file is transformed into a placed and routed design for the
Virtex® FPGA from the synthesis run, and the image file is generated by running the
Xilinx® M1 Alliance Series place and route tools.
This image is placed with the C project directory and is used to configure the PE by
calling the WC_PeProgramFromFile( ) API routine. After the PE image is loaded onto
the device, the PE clock frequency is set by calling the routine WC_SetClkFrequency( ),
the interrupt lines are enabled, and the Reset line de-asserted. The WILDCARDTM board
is then ready for transfer of data to the on-board SRAM memories.
The distance data is written into the left memory, while the right SRAM memory is
used for storing the output tree data. The number of taxa on which we are operating is
written into a single 32-bit register on the PE. The host C program then goes into “sleep”
mode, waiting for the PE interrupt to be set. Meanwhile, the UPGMA logic starts
executing, and operates on the distance data in order to generate the output tree data.
After the design finishes processing, it generates a “done” signal that is tied to the PE
interrupt line. Once the PE interrupt is set, the host C program comes out of its wait state
and starts processing the PE interrupt. The host program clears the interrupt and starts
reading the Phylogenetic tree data from the Right SRAM memory. Once all the output
tree data is read from the right memory and written into an output file, the “driver”
program clears all the memory buffers allocated during the execution, and proceeds to
“close” the device by calling the WC_Close( ) API routine. The C code for the host-
based experiment “driver” program is provided in the Appendix D.
5.2 Generating Random Taxa Test Data Sets
A program written in C++, using the MFC programming environment, was used
to generate the test data for testing the implementation of the UPGMA algorithm. The
program takes as input the following parameters: (1) the number of taxa; (2) the
maximum value of inter-node distance; and, (3) the number of repetitions of a single
distance value in the data set.
The test data are generated for taxa sizes of 10, 16, 32, 50, 64, 75, 100, 128, 150,
175, 200, 225 and 256. For each taxa size, ten different data sets are randomly generated
for that number of taxa. Furthermore, each created data set has its data values subjected
to permutations, creating up to 10 permutations per data set per number of taxa. The C++
code for the test data generation is given in Appendix E.
Figure16: Test Data Generator Input Dialog Box.
Figure 16 presents a screenshot of the dialog box used by the program for
generating test data. The taxa size, maximum nodal distance, and the number of
repetitions of a particular distance value are given as inputs. When the data has been
generated the program pops up a confirmation dialog box.
The data values are generated randomly making sure that each value is within the
maximum nodal distance limit set by the user. Also, the specified number of repetitions
of each value in the data set is constrained to be less than or equal to that specified for
repetitions by the user. For each taxa size, ten different datasets are generated and for
each of these ten datasets ten different permutations are generated by changing the
positions of the distance values within the distance matrix.
5.3 Measuring Time
The time taken for the UPGMA implementation to execute on the WILDCARDTM
system is measured using standard C time function calls. Time measurements are
collected for the time taken for the program to transfer the distance data to the memory,
generate the tree, and read back the output tree data from the memory. The time is
measured in terms of CPU clock ticks using the standard C language clock( ) function
call.
The time taken for memory transfer is measured separately in order to analyze the
cost of transferring data to and from the WILDCARDTM memory banks. This is done to
give us an idea of how the cost affects the performance of the implementation for high
values of N, the number of taxa. The current maximum of 256 taxa limits the number of
distance values to be written to the memory. Also, the WILDCARDTM memory banks are
65536 (64K) words, with each word being 32-bits in width. This constrains the number
of taxa that can be operated on for a given UPGMA run.
Through independent tests written for the WILDCARDTM system, the time taken
for writing to each of the memories has been collected and tabulated. Although not
shown here, the numbers collected indicate that the cost for writing the entire memory is
about 20 CPU clock ticks, while reading back the entire memory is about 100 clock ticks
on the 800 MHz Pentium III processor serving as the experimental workstation. This
indicates that reading data from the SRAM memories from the host is a more expensive
operation than writing to them.
For the purposes of our experiment, we write distance data and read back tree data
from the left and right on-board SRAM memories, respectively. As we are constrained to
operate upon a 256 taxa data set, the number of distance values needed to write a
complete matrix to the memory 32,640, while the maximum number of memory locations
to be read back for the resulting tree is 511. These two values have been obtained from
the following analysis formulae.
Number of distance values = N(N-1)/2
No of tree nodes = 2(N) – 1
Thus, considering that our largest data set has values to be written that are less
than the maximum capacity of the memory banks, the memory writes take less than 20
CPU ticks for writing the entire memory array. Similarly, the number of output words to
be read back from the memory is small compared to the full memory size. Thus, the time
taken is much less than the 100 ticks for reading an entire memory. Thus, the cost of
writing and reading from the on-board SRAM memories does not form a big factor in the
performance overhead for our implementation.
However, if we have to look at realistic values of N, which can go up to as high as
10,000 the cost for reading and writing the memories would important. For obtaining an
idea as to what the cost might be, we can extrapolate the timing data assuming that we
have unlimited memory capacities on our hardware board. We could try to write to the
WILDCARDTM memory banks multiple times to find out the cost for writing more than
65,536 words and use this value to obtain a gross estimate of the cost to write to the
memory for large datasets. Similarly we could read back data from the memory multiple
times and obtain and estimate of the cost to read the memory for large datasets.
This cost data could then used to obtain a gross estimate on the overall
performance of the algorithm being implemented on the hardware. This provides a
theoretical extrapolation, and not the exact performance cost; however, it provides
valuable information on how the algorithm performance might scale to larger number of
taxa data sets, and whether the memory access costs would have a significant or minor
impact (assuming we had reasonably unlimited memory available). This data is
presented in Chapter 7 as part of the discussion of conclusions of this research.
CHAPTER 6
EXPERIMENTAL METHOD AND RESULTS
In this chapter, we present the results of running Phylogenetics data sets against
the UPGMA implementation on the WILDCARDTM-based reconfigurable custom
computing machine. We present the resultant data sets in terms of a bounded clock cycle
count using the clocking frequency of the host PC’s CPU clock, which gives us a count
of the total number of host clock cycles for a given computation run. We use this, as
opposed to using the on-board FPGA clock, as the former takes into account the
communication overhead of getting data to and from the WILDCARDTM board.
6.1 Running the Experiments
We take randomly generated data sets, permute them, and execute them on the
WILDCARDTM. We then increase the number of taxa considered in the input distance
matrix, generate new data sets and permute them, and execute them on the UPGMA
processor. The test data for taxa sizes of 10, 16, 32, 50, 64, 75, 100, 128, 150, 175, 200,
225 and 256 were executed. For each taxa size, ten different data sets along with 10
different permutations of certain datasets were run and timing results collected. The
results are described in the following sections.
6.2 Experimental Results for Latency
The time taken for each taxon-size data measured in number of CPU clock ticks,
for different datasets and permutations is collected. The average time taken for ten
different permutations of each of the ten datasets for each taxon size is given in Tables 4
through 6. Ttotal is the total latency of the platform--the number of clock ticks taken for a
complete design run including the data transfer between the host and the WILDCARDTM.
One aspect of defining the data set for purposes of running experiments is
permuting the data to assess whether permutation impacts the execution latency. In some
implementations of UPGMA in software, permutation might affect the execution of a
given data set at some number of taxa. The permutations were randomly generated along
with the data sets. However, we wanted to make sure whether this aspect of the
organization of data would affect the design in some meaningful way before taking the
time to blindly run the experiments.
Our expectation was that permuting the data would not be much of a factor in
variation of latency values, because the time to perform actual computations on fixed-
width operators is largely independent of the actual data values passed as the operators.
From the data collected from the sample permutation runs, this seems to be the case.
This is shown in Table 3 and Figure 17 below.
Table 3: Timing Results for permuted datasets
Figure 17. Frequency Distribution for Latency versus Taxa Data Set Permutation.
From this analysis of the permutation, we conclude that we don’t need to consider
permutation of the data set values for a particular execution run. Therefore, we focus our
presentation of the data on the different UPGMA execution runs using randomized data
sets for each of the selected number of Phylogenetic taxa.
We next examine the response of the custom computing machine in terms of the
Mean Latency (averaging the data set samples) versus the number of taxa.
This is shown several different ways, so as to highlight the statistical convergence of the
latency values around the mean values computed across the ten randomized data sets.
The first plot in Figure 18 shows the basic Latency response curve as the number
of taxa increases to the maximum value of 256—the maximum number that can be stored
in the available memory on the WILDCARDTM, given the architecture.
We wanted to evaluate the deviation from the mean over the data sets for each
number of taxa, and observe what happens to this deviation as the number of taxa
increases to the maximum targeted for this research. What we see in the Latency data for
the different data sets--for a given number of taxa--is that the data tends to tightly cluster
around the mean, indicating minimal deviation. There is some wider variance as the
number of taxa grows, as evidenced from the curve in Figure 19, which gives the Latency
in log scale. The variance is little bit more for 200 and 256 datasets as seen in the curve.
The rest of the datasets seem to converge pretty well.
We basically are not able to grow the number of taxa on the current
reconfigurable computing platform based on the WILDCARDTM to see whether there is a
real trend in the deviation data or not. However we believe that the data results would
not be affected as the number of taxa grows. This is due to the fact that hardware
computation speed is relatively fixed for fixed data bit-widths. The combinational circuit
would have a fixed latency, thus the computations would have a fixed latency. This leads
us to believe that the data results would not differ significantly with increase in the
number of taxa.
Figure 18. Mean Latency versus Number of Taxa (Normal Scale).
Figure 19: Latency versus Number of Taxa
The deviation we see in our current results, we believe, can be attributed to
“noise” on the host PC side, as the host is not dedicated to running the WILDCARDTM
program exclusively, but at the same time has other processes running that can skew the
count of the clock ticks.
Table 4 given below gives us the timing results for the WILDCARDTM UPGMA
program to run datasets of different taxa sizes. It gives us the latency in the number of
clock ticks for each of the ten different datasets for every taxon size. We also have the
mean, standard deviation and variance of the 10 datasets for a given number of taxa.
We look at the performance of the UPGMA processor and compare it with other
complexity functions to obtain an upper bounding in terms of Big-Oh.
Table 4. Latency Values for Data Sets at Generated Number of Taxa.
6.3 Bounding Time Complexity
Given the Latency curve as our number of taxa grows, we want to understand the
results in terms of the time complexity. Stanat and McAllister [27] provide an
appropriate taxonomy on which we can attempt to qualitatively “fit” our resultant
performance curve against those of standard time complexity functions. Given that we
have selected a means of measuring Latency that incorporates communication overhead,
and that we randomize and permute our data sets, we assume we are working with worst-
case behavior. We want to understand the behavior in terms of the standard forms of
Big-Oh.
Our first attempt is to compare our Latency plot against the base function plots for
O(N), O(n log(n)), O(n2) time complexity patterns. We show the Excel® plots
for the data shown in Table 4 in the plots of Figure 20 for both normal scale and for
logarithmic scale. We attempt to carry out a qualitative assessment of time complexity
bounds without resorting to deriving more precise recurrence expressions—although we
are able to generate curve-fitting equations directly onto the Excel plots.
What we see from the plots is that—given the limited range of N (number of taxa)
covered under the scope of this research—we appear bounded by O(n log(n)) time
complexity. However, the other conclusion we draw from this data is that we are too
constrained by lack of a sizable memory space (space complexity) in which to store a
larger number of matrix distance values for processing a greater number of taxa, N.
Therefore, we cannot draw a definitive conclusion about the performance of the
computing system for large values of N. However, we will explore what we might need
to do to grow to considerably larger values of N, into the thousands of taxa, in the
conclusion of this work. Also, to assess the benefits, we’ll use comparative data.
Figure 20. Bounding of Latency by Time Complexity Functions.
Figure 21. Bounding Latency by Time Complexity Functions Computed in Excel.
Finally, before leaving this aspect of the analysis, we show a different plot in
Figure 21, showing how difficult it is to qualitatively assess the time complexity, by
using the Excel® plot of trend analysis of the Latency curve, showing both square and
cube polynomial trend curves. The Excel software uses regression analysis to come up
with the trendlines. The trendlines help us in predicting the behavior of the Latency
curve with increase in number of taxa beyond our current 256 max size. We just don’t
have enough experimental data ourselves to see what happens to the Latency for larger
values of N. For this, we’d need to move the design to a larger platform—such as the
Star Bridge HC-36m or the SRC 6e, which would be the subject of future research.
However the trendlines give us an idea on characterizing our upper bound performance
for values of N greater than 256, using the trend curves as a guide, and how they might
scale. From the trendline future prediction we can know the R2 (R-square) value. The R2
value, also known as coefficient of determination, ranges from 0 to 1 and helps us in
deciding whether the estimated predictive values of the trendline accurately match the
actual data. A trendline is most reliable when the R2 value is at or near 1. Thus we see
that the cube polynomial trendline provides the best bounding for the Latency curve as its
R2 value is better than that of the square polynomial trendline. So we feel that the
algorithm complexity is bounded by O(N3) for the hardware implementation.
We now, try to measure the quality of the solution by comparing the performance
of the reconfigurable custom computing solution against that of the baseline execution of
PHYLIP, the version of UPGMA software written in C by Felsenstein et al. [10].
6.4 Benchmarking Against PHYLIP
The software timing data is collected by running the PHYLIP UPGMA C code on
the same PC on which the WILDCARDTM host program is run. As before, we execute
the experiment and collect run-time data across a range of values of N, with different
randomized data sets that have been permuted (selecting half the number of permutations
as for the hardware version, for sake of brevity). For this, we use the same data sets that
were used to execute the UPGMA algorithm running on the WILDCARDTM. The
average time taken for the program to run under five different permutations of each of the
ten different datasets for each taxon size is given in Table 5 that follows. The run-time
plots corresponding to those for Latency of the software version are given in the figures
that follow.
Figure 22. PHYLIP C run time performance
Figure 22 shows the comparison of the PHYLIP C run time performance with the
N(log N), and N2 curves. These curves were added using trendlines without predictive
analysis. From this plot, we observe that the PHYLIP C run-time performance curve is
bounded by the N2 curve yet closely follows the plot for N(logN) as its lower bound.
We believe that our limited number of taxa does not show the true nature of the curve,
and thus we would like to get a better bounding to obtain a closer match for the algorithm
complexity.
The first plot in Figure 23 provides us the plots seen in Figure 22 in a log scale. In
this scale we find N(log N) pretty closely matching the C run time performance but we
still cannot predict accurately the complexity of the algorithm for larger values of N. In
the second plot of Figure 23 we have two polynomial trendlines around the C run time
performance curve. We have used forward prediction and obtained the R2 values to look
at the accuracy of the trendlines. We find that the cube polynomial trendline matches
much better with the R2 value very close to 1. This tells us that the C algorithm provided
by Felsenstein[19], seemingly, has a complexity of O(N3).
Figure 23. PHYLIP C run time performance with Time-complexity bounding
Table 5. PHYLIP Run-time Raw Data Set.
We now look at the performance comparison of hardware and software
implementations. The Average number of clock ticks taken for each of the taxa sizes, for
both the hardware and software implementations, is given in Table 6. The results show a
significant improvement for taxa up to 64, but then the rate of improvement starts to
decline as the taxa size increases to the 256 maximum for the experiments.
Taxa Hardware Software Improvement10 8.4 121 14.416 8.5 170.1 2032 9.4 315.3 33.557 12 541.7 45.164 14 713 50.975 20.3 816 40.2100 39.6 942.1 23.8128 71.6 1107.5 15.5150 110.2 1278 11.6175 162.5 1479 9.1200 242.6 1788.8 7.4225 342.1 2250.7 6.6256 504.4 2659.9 5.3
Table 6. Data Comparison Between Hardware and Software UPGMA Implementations.
This behavior in the hardware implementation is accounted for by the fact that the
FPGA-based design used to implement taxa count of 75 and above were adversely
affected by the memory addressing scheme (discussed in chapter 4) made in the final
architecture modifications of the design resident on the WILDCARDTM. The negative
impact in performance of the design is also attributed to the four-cycle latency for a
SRAM memory read. This latency induces wait states in the control structure causing the
design to run slower. The address modification would have not been necessary had the
WILDCARDTM system had larger memory banks and the original addressing scheme of
concatenating nodal values were still used.
The improvement over the PHYLIP software implementation goes as high as 50
times for a 64-taxa data set size. Beyond this size, the on-board memory address
modification becomes necessary, thus causing the design performance to deteriorate.
This can be seen as the 75-taxa data set size performance reduces to 40 times speed up
and far less for larger data set sizes, up to the maximum. This deterioration is attributed
to the low memory capability of the WILDCARDTM board and larger memory banks, if
available, should help scale the design more effectively. The size of the Virtex
XCV300E chip also inhibits us from the implementing parallel or pipelined design
architectures that might help in reducing latency of the memory addressing to a certain
extent.
Figure 24. Plotting the Performance Improvement over PHYLIP as Taxa Count Grows.
Figure 24 provides a plotted view of how the algorithm scales, as N grows large.
Given the maximum taxa data set at 256, we see that the performance deteriorates once
we encounter the increased overhead of memory address computation on data sets for
more than 64 taxa.
Thus, due to inherent limitations of the WILDCARDTM hardware board on which
our design executes, we could not obtain performance improvements that might
otherwise be obtained by applying custom logic/custom computing methodologies. A
larger board with a larger memory size would allow scaling for larger taxa counts and
certainly provide better performance over the software implementation in PHYLIP.
Figure 25. Plotting the Performance Difference as Taxa Count Grows.
However, if we look at the plot of the performance data itself, and compare the
two curves, we see that the performance improvement does indeed seem to scale, as the
performance curve for the PHYLIP implementation of UPGMA grows at a faster rate
than that of the implementation of UPGMA as a custom computing machine architecture.
Figure 26. Plotting the Performance Difference as Taxa Count Grows (Log Plot).
If we observe the trend as a logarithmic plot, we see that, for the peak
performance point for the custom computing implementation on the WILDCARDTM
(between 64 and 75 taxa), we are operating close to two-orders of magnitude faster than
the software PHYLIP implementation. Furthermore, we see that this improvement
decreases to a single order of magnitude—with an apparently decreasing trend in order-
of-magnitude performance difference as we grow to the limit of 256 taxa. This
corroborates the earlier plot showing a final 5X difference in performance between the
two implementations.
CHAPTER 7
SUMMARY AND CONCLUSIONS
7.1 Summary of Research Contributions
Custom computing systems built using reconfigurable logic devices, provide
several orders of magnitude speed-up in execution performance of algorithms over the
execution of these on conventional microprocessor-based systems. In addition, such
systems have the flexibility to program--and reprogram via reconfiguration--the actual
logic functions of the VLSI circuit with different applications in time and space. Custom
computing systems are implemented using FPGA custom-logic devices that are easily
and quickly programmed by an end-user. This research conducted design and analysis of
a custom computing application architecture for the UPGMA Bioinformatics algorithm
implemented on an FPGA-based custom-computing platform. We had looked at different
architectures of the design for the purpose of achieving better resource usage and also to
conform to the constraints of the hardware resources—most notably memory--on the
WILDCARDTM. We discussed the final architecture created and presented results of the
system performance, as measured and compared against that of the UPGMA algorithm
written in C, running on a single-processor Pentium® PC.
7.2 Conclusions
The results presented in Chapter 6 provided us with an insight into the
performance of both the hardware and software implementations. The hardware results
showed little variance for different permutations of a dataset for a given number of taxa.
The timing results also converge towards a mean value showing very little variance over
different datasets for a given number of taxa. The hardware results showed significant
improvement over the software implementation with performance peaking at the 64 taxa
datasets. For datasets of 74 taxa and above, the performance began to degrade
considerably, compared to that of the PHYLIP software implementation—although the
custom computing implementation was still between half- to a full-order of magnitude
faster. The hardware implementation was 50 times faster than the software
implementation for the 64–taxa datasets, indicating a reasonable performance
improvement, given the architectural limitations of memory addressing cited earlier.
We have also shown that, using predictive analysis in Excel, both the
implementations are bounded by functions that are time complexity of O(N3). The
polynomial equations generated for both the hardware and software performance curves
were of the order of N3 with a large difference being based on the coefficients and
constants of each time function. The predictive polynomial equation generated for the
software performance curve shown in Chapter 6 had large constant and coefficient values
compared to that of the polynomial equation of the hardware performance curve. These
predictive polynomial equations though do not represent actual values but do give us an
accurate estimation as to the behavior of both the implementations if we had been able to
scale beyond the limited number of 256 taxa.
The large values of the coefficients are indicative of the fact that software
implementation has underlying compile-related and operating system overhead that affect
its performance. The hardware implementation of the UPGMA algorithm avoids these
sources of overhead by its implementation of the computational units directly onto
FPGA-based hardware storage and functional units. This provides a considerable speed-
up, facilitating a higher-performance solution as is evidenced through the results
obtained.
However we see that the hardware performance degrades rapidly for datasets of
75 taxa and above. The performance degradation is attributed to the linear addressing
scheme used for the final architecture, and the latency for a single read from the
WILDCARDTM on-board SRAM memory banks. The read from the memory banks takes
4 clock cycles which adds additional wait states within the control structure that
negatively impacts the performance of the design. A linear addressing scheme was
employed to facilitate the implementation of taxa of 75 and above. The WILDCARDTM
memory provides us a maximum of 65536 word on each bank and this limitation forced
us to modify the address generated using the original addressing scheme into a linear
addressing scheme. The original addressing scheme would generate address values
greater than 65536 yet limits the number of taxa that could be implemented, even though
datasets of 256 taxa could be stored within the 65536 memory locations. The linear
addressing scheme enables the design to implement larger datasets up to the 256 taxa
limit, the maximum taxa size defined as a goal of this research. The address
modifications necessary for this purpose induces additional states in the control structure,
adversely impacting the performance of the design.
The original addressing scheme would require larger memory capabilities on the
hardware that the WILDCARDTM platform lacks. We discuss in the sections below how
larger memory banks--as well as certain architecture modifications--might improve the
performance.
7.3 Future Work
In the early sections we have looked at certain issues that hampered the
performance of the UPGMA design implemented in the WILDCARDTM system. We list
these issues below:
Memory size limitation and Memory address schemes
Latency for the memory read
Device size (FPGA resources)
We discuss these issues to see how alleviating these bottlenecks might be used to
increase performance.
7.3.1 Memory size and Memory address schemes
The WILDCARDTM system provides two memory banks with 65536 words on
each. The left memory bank was used for storing the distance matrix data and the right
memory for storing the tree output data. The limit of 65536 words on the left memory
necessitates address modification which degrades the design performance. We could
overcome this problem two ways:
Generating a better addressing scheme
Going for platforms with larger memory banks
The first option would give us a solution that could be implemented on the
currently available WILDCARDTM board, but is going to be a very difficult one as the
address modification is complex as described in Chapter 4. The second option is easier
and would require us to explore more custom computing platforms that offer larger
memory capabilities. Larger memory capabilities would eliminate the need for address
modifications and different addressing schemes.
The time taken to write and read to the memories on the WILDCARDTM board
from the host is also a significant factor in measuring the performance of the design. We
have seen that to write the entire memory bank takes 20 clock ticks on a 800 Mhz Intel
Pentium host system while to read an entire memory bank takes 100 clock ticks. We have
seen that the time taken increases linearly with increase in no of writes or no of reads.
Therefore to read the memory banks thrice the host would take 300 clock ticks and to
write thrice it would take 60 clock ticks. This linear increase would most certainly affect
the performance and we would need to look at other architectures that might provide
better performance in terms of reading and writing from the host.
7.3.2 Latency for a Memory Read
We have seen in the earlier sections that the memory read in the WILDCARDTM
system takes 4 cycles. This hurts the performance by slowing the operation of the design.
To overcome this we have to look other custom computing platforms that offer better
read and write cycle latency. This would remove the additional wait states induced into
the control structure and speed up the performance of the design.
7.3.3 Device size
The WILDCARDTM has a Xilinx Virtex XCV300E chip on it. The Virtex-E chip
has a total of 3072 slices. This is small compared to the Virtex II device, which has a
total of 33732 slices, offering much more space to implement larger designs and also
would enable us to look at different architectures of the algorithm under consideration,
namely, parallel or pipelined architectures. We have seen in the literature that, in general,
parallel architectures offer a very good performance improvement [22, 23].
The current implementation of the design takes up 60 percent of the
WILDCARDTM Virtex E chip. A parallel implementation would likely have multiple
copies of the design components, such as the datapath, control path, etc., running in
parallel. These multiple units would work on sub-parts of the distance matrix. This
parallel operation would speed up the design by a large extent but the multiple parallel
units would increase the design size, and there would be some penalty in the
communication overhead of the interacting subparts of the problem.
Thus, to implement a parallel architecture we would need a larger device or
multiple devices to ensure that we do not run out of resources. However the speed up in
performance that can be obtained is attractive enough that it definitely warrants an
exploration into the trade off between increased resources, parallelism versus
communication overhead, and the impact on computation speed. Therefore, future work
should investigate different custom computing architectures offering the requisite
resources to implement a parallel architecture of the UPGMA algorithm on a custom
computing fabric.
We have looked at different issues that caused problems in implementing the
UPGMA algorithm on the WILDCARDTM system and have also discussed how we could
be able to resolve these problems. The options suggested are presented as future work
that might be of great interest and might enable us in obtaining a performance
improvement for the UPGMA algorithm that conceivably could alter the upper bound of
the time complexity of the algorithm itself.
BIBLIOGRAPHY
[1] Andre DeHon, Wawrzynek, The case of reconfigurable processors. Berkeley
Reconfigurable Architectures Systems, and Software. University of California,
Berkeley. http://citeseer.nj.nec.com/dehon97case.html.
[2] Nick Tredennick, The case of reconfigurable computing. Micro Design
Resources, Microprocessor Report, Vol.10, No.10, Aug 1996.
[3] Stephen Brown and Jonathan Rose, Architecture of FPGAs and CPLDs: A
Tutorial, IEEE Design and Test of Computers, Vol. 13, No. 2, pp. 42-57, 1996.
[4] John V. Oldfield, Richard C. Dorf, System Implementation Strategies, Chapter 1,
Field Programmable Gate Arrays, Reconfigurable logic for Rapid Prototyping and
Implementation of Digital Systems, pg 1-26, Wiley-Interscience Publishing, 1995.
[5] Paul Graham, Brent Nelson, FPGA based Sonar processing. ACM/SIGDA
International Symposium for Field Programmable Gate Arrays. Pg 201-208.
February 1998. http://www.dynamicsilicon.com/Articles/Reconfigurable.pdf
[6] Jeffrey Arnold, Kenneth L. Pocek, Genetic Algorithms In Software and In
Hardware - A Performance Analysis of Workstation and Custom Computing
Machine Implementations, Proceedings of IEEE symposium of Field
Programmable Custom Computing Machines, pg 216-225, April 1996. IEEE
Computer Society.
[7] Jason R. Hess, David C. Lee, Scott J. Harper, Mark T. Jones, and Peter M.
Athanas, Implementation and Evaluation of a Prototype Reconfigurable Router,
Proceedings of IEEE symposium of Field Programmable Custom Computing
Machines, pg 44-50, April 1999. IEEE Computer Society.
[8] R. Petersen, B. L. Hutchings, An Assessment of the Suitability of FPGA-Based
Systems for Use in Digital Signal Processing, In 5th International Workshop on
Field Programmable Logic and Applications, pp 293-302, August 1995, Oxford,
England.
[9] P.W. Dowd, J.T. McHenry, F.A. Pellegrino, T.M. Carrozzi and W.B. Cocks, An
FPGA-Based Coprocessor for ATM Firewalls, Proceedings of the IEEE
Symposium on FPGA's for Custom Computing Machines (FCCM97), pg 30-39,
April 1997.
[10] Joe Felsenstein, PHYLIP source code, Department of Genome Sciences,
University of Washington, http://evolution.genetics.washington.edu/phylip.html
[11] R. Shamir, UPGMA, Tel Aviv University,
http://www.math.tau.ac.il/~rshamir/algmb/00/scribe00/html/lec08/node21.html
[12] Peter H. Weston, Michael D. Crisp, Introduction to Phylogenetic Systematics,
Invited Contributions of the Society of Australian Systematic Biologists, SASB,
http://www.science.uts.edu.au/sasb/WestonCrsip.html.
[13] James P. Davis, Peter J. Waddell, Sreesa Akella, Methods and Architectures for
Realizing Fast Phylogenetic Computation Engines Using VLSI Array Based
Logic, Submitted to IEEE Bioinformatics Conference, Aug, 2002.
[14] R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Building phylogenetic trees, Chapter
7, Biological Sequence Analysis, pg 160-190. Cambridge University Press, 1998.
[15] D.L. Swofford, G.J. Olsen, P.J. Waddell, and D.M. Hillis, Phylogenetic Inference,
Chapter 11, Molecular Systematics, pg 45-572, second edition, (ed. D.M. Hillis,
and C. Mortiz), Sinauer Association, Sunderland, MA, 1996.
[16] Xilinx Inc, Virtex-E 1.8V FPGA Complete Datasheet, March 2003
[17] Duncan A. Buell, Jeffrey M. Arnold, Walter J. Kleinfelder, SPLASH2 FPGAs in a
Custom Computing Machine, IEEE Computer Society Press, 1996.
[18] Annapolis Microsystems Inc, Annapolis WILDCARDTM System Reference Manual,
Revision 2.6, 2003. www.annapmicro.com
[19] SRC Computers Inc., www.srccomputers.com.
[20] StarBridge Systems, www.starbridgesystems.com
[21] Yutana Jawchinda, Hideaki Kobayashi, Quantifying Design Reuse: An HDL-
Based Design Experiment, International HDL Conference, April, 1999.
[22] H. J. Whitehouse, J. M. Speiser, K. Bromley, Signal Processing Applications of
Concurrent Array Processor Technology, Chapter 2, VLSI and Modern Signal
Processing, Prentice-Hall, Inc., 1985.
[23] Axelrod, R., The Complexity of Cooperation: Agent-Based Models of Competition
and Cooperation, Princeton University Press, 1997.
[24] Billsus, D., C. A. Brunk, C. Evans, B. Gladish and M. Pazzani, “Adaptive
Interfaces for Ubiquitous Web Access”, Communications of the ACM, Vol. 45,
No. 5, May 2002, pp. 34-38.
[25] Stanat, D. F. and D. F. McAllister, Discrete Mathematics in Computer Science,
Prentice Hall, Inc., 1977.
APPENDIX A
VHDL SOURCE CODE
---------------------------------------------------------- Add, Subtract decrement modules needed for -- Address modification------------------------------------------------------------------------------------------------------------------ Author : Sreesa Akella-- File : add1.vhd-- Entity : add_1, sub_1, dec_1-- Architecture : add_1_beh, sub_1_beh, dec_1_beh--------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_signed.all;use ieee.std_logic_arith.all;
entity add_1 is port( in1 : in std_logic_vector(9 downto 0); in2 : in std_logic_vector(9 downto 0); opt : out std_logic_vector(9 downto 0) );end add_1;
architecture add_1_beh of add_1 is begin
opt <= in1 + in2;
end add_1_beh;
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_signed.all;use ieee.std_logic_arith.all;
entity sub_1 is port( in1 : in std_logic_vector(9 downto 0); in2 : in std_logic_vector(9 downto 0); opt : out std_logic_vector(15 downto 0) );end sub_1;
architecture sub_1_beh of sub_1 is begin opt <= ("000000" & in1) - ("000000" & in2);end sub_1_beh;
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_signed.all;use ieee.std_logic_arith.all;
entity dec_1 is port( inp : in std_logic_vector(15 downto 0); opt : out std_logic_vector(15 downto 0) );end dec_1;
architecture dec_1_beh of dec_1 is begin
opt <= inp - '1';
end dec_1_beh;
-------------------------------------------------------- Height Adder-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : adder_h.vhd-- Entity : adder_h-- Architecture : adder_h------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_signed.all;use ieee.std_logic_arith.all;
entity adder_h is port( Datainp1 : in std_logic_vector(15 downto 0); Datainp2 : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(15 downto 0) );end adder_h;
architecture adderh_beh of adder_h is begin
Data_out <= Datainp1 + Datainp2;
end adderh_beh;
-------------------------------------------------------- Adder module------------------------------------------------------------------------------------------------------------
-- Author : Sreesa Akella-- File : addernew.vhd-- Entity : adder-- Architecture : adder_beh------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_signed.all;use ieee.std_logic_arith.all;
entity adder is port( Datainp1 : in std_logic_vector(31 downto 0); Datainp2 : in std_logic_vector(31 downto 0); Data_out : out std_logic_vector(31 downto 0) );end adder;
architecture adder_beh of adder is begin
Data_out <= Datainp1 + Datainp2; end adder_beh;
-------------------------------------------------------- Address register entity architecture -------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : addr_dd.vhd-- Entity : addr_dd-- Architecture : addr_dd_beh------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;
entity addr_dd isport(addr : in std_logic_vector(19 downto 0); reset : in std_logic; clk : in std_logic;
addr_dd_s : out std_logic_vector(19 downto 0));end addr_dd;
architecture addr_dd_beh of addr_dd is
signal temp, temp1 : std_logic_vector(19 downto 0);
begin process(clk, reset) begin if reset = '1' then
temp <= (others => '0'); temp1 <= (others => '0');
addr_dd_s <= (others => '0'); elsif clk = '1' and clk'event then temp <= addr; temp1 <= temp; addr_dd_s <= temp1; end if; end process;end addr_dd_beh;
-------------------------------------------------------- Height Adder Register entity architecture pair-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : haddregisterh.vhd-- Entity : adderhreg-- Architecture : addhreg_beh------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;
entity adderhreg isport(addout : in std_logic_vector(15 downto 0);
reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(15 downto 0));end adderhreg;
architecture addhreg_beh of adderhreg isbeginprocess(clk, reset)begin if reset = '1' then regval <= (others => '0'); elsif clk = '1' and clk'event then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= addout; end if; end if;end process;end addhreg_beh;
-------------------------------------------------------- Distance comparison unit entity architecture pair------------------------------------------------------------------------------------------------------------
-- Author : Sreesa Akella-- File : comparedst.vhd-- Entity : comparedst-- Architecture : comparedst_beh------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;
entity comparedst is port(Datainp1 : in std_logic_vector(31 downto 0); valid_dst : in std_logic; distreg_val : in std_logic_vector(31 downto 0); first_val : in std_logic; addr : in std_logic_vector(19 downto 0); distreginp : out std_logic_vector(31 downto 0); distreg_en : out std_logic; rowreginp : out std_logic_vector(9 downto 0); rowreg_en : out std_logic; colreginp : out std_logic_vector(9 downto 0); colreg_en : out std_logic; addr1_reg_en : out std_logic; addr2_reg_en : out std_logic);end comparedst;
architecture comparedst_beh of comparedst isbeginprocess(addr, first_val, Datainp1, distreg_val, valid_dst)begin if first_val = '1' and valid_dst = '1' then rowreginp <= addr(19 downto 10); colreginp <= addr(9 downto 0); distreginp <= Datainp1; rowreg_en <= '1'; colreg_en <= '1'; distreg_en <= '1'; addr1_reg_en <= '1'; addr2_reg_en <= '1'; else if Datainp1 < distreg_val and valid_dst = '1' then rowreginp <= addr(19 downto 10); colreginp <= addr(9 downto 0); distreginp <= Datainp1; rowreg_en <= '1'; colreg_en <= '1'; distreg_en <= '1'; addr1_reg_en <= '1'; addr2_reg_en <= '1'; else rowreginp <= (others => '0'); colreginp <= (others => '0'); distreginp <= (others => '0'); rowreg_en <= '0'; colreg_en <= '0'; distreg_en <= '0'; addr1_reg_en <= '0';
addr2_reg_en <= '0'; end if; end if;end process;end comparedst_beh;
-------------------------------------------------------- Controller entity architecture-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : controllerverfull9.vhd-- Entity : ctrl_blk-- Architecture : ctrl_beh------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;entity ctrl_blk is
port(clk : in std_logic; reset : in std_logic; valid_numsp : in std_logic; addr_grt : in std_logic; child_cnt_gr : in std_logic; count_gr : in std_logic; all_nodes_done : in std_logic; initialized : in std_logic; div_valid : in std_logic; a_grt : in std_logic; ext_node : in std_logic; r_clr : out std_logic; r_inc : out std_logic; a_clr : out std_logic; a_inc : out std_logic; c2_read : out std_logic; mem_update : out std_logic; R_dec : out std_logic; Rp_dec : out std_logic; c1_incr : out std_logic; c2_incr : out std_logic; c1p_incr : out std_logic;
ch_incr : out std_logic; c1_load1 : out std_logic; c1_load2 : out std_logic; c2_load1 : out std_logic; c2_load2 : out std_logic; c1p_load : out std_logic; ch_load : out std_logic; c1_clr : out std_logic; c2_clr : out std_logic; c1p_clr : out std_logic; ch_clr : out std_logic; row_col_sel : out std_logic_vector(1 downto 0); addr2_reg_dec : out std_logic; node_write : out std_logic; read_mem : out std_logic;
write_mem : out std_logic; read_wmem : out std_logic; write_wmem : out std_logic; rowreg_clr : out std_logic; colreg_clr : out std_logic; distreg_clr : out std_logic; mulreg_en : out std_logic; mulreg_clr : out std_logic; addregwclr : out std_logic; addregwen : out std_logic; addregclr : out std_logic; addregen : out std_logic; divreg1clr : out std_logic; divreg1en : out std_logic; initial_run : out std_logic; store_cur_addr : out std_logic; node_mem_initialize : out std_logic; mem_initialize : out std_logic; addr_gen1_en : out std_logic; addr_gen2_en : out std_logic; rmem_read : out std_logic; rmem_write : out std_logic; ad_reg_en : out std_logic; ad_reg_clr : out std_logic; row_zero : out std_logic; numsp_val : out std_logic;
valid_td : out std_logic; nodeid_sel : out std_logic_vector(1 downto 0);
n_type_sel : out std_logic; incnt_inc : out std_logic; done : out std_logic );end ctrl_blk;
architecture ctrl_beh of ctrl_blk is
type state is(idle, wait_init, rmem_init, rmem_init1, node_mem_init, mem_init, wait_st, rmem_read_st, addr_mod1, addr_mod2, fetch_dst2, compare_dst, c2_inc_ld_st, wait_st1, addr2_gen_st, fetch_dst, wait_st2, add_dst, mul_dst, div_dst, wait_rmem, write_dist_to_mem, write_mem_wait, c2_incr_st, c2_read_st1, c2_read_st2, c2_read_st22, br_update1, br_update2, rmem_read_st2, addr_mod1_st, addr_mod2_st, tree_map_init1, tree_map_init2, tree_map_int, tree_map1, tree_map2, done_st);
signal cur_st : state;signal count_cycle : integer range 0 to 3;
beginprocess(clk, reset)beginif reset = '1' then store_cur_addr <= '0'; r_clr <= '0'; r_inc <= '0'; a_clr <= '0'; a_inc <= '0'; rmem_read <= '0';
rmem_write <='0'; ad_reg_en <= '0'; ad_reg_clr <= '0'; row_zero <= '0'; mem_initialize <= '0'; node_mem_initialize <= '0'; count_cycle <= 0; read_mem <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; addr_gen1_en <= '0'; addr_gen2_en <= '0'; c2_read <= '0'; mem_update <= '0'; R_dec <= '0'; Rp_dec <= '0'; c1_incr <= '0'; c2_incr <= '0'; c1p_incr <= '0'; ch_incr <= '0'; c1_load1 <= '0'; c1_load2 <= '0'; c2_load1 <= '0'; c2_load2 <= '0'; c1p_load <= '0'; ch_load <= '0'; c1_clr <= '0'; c2_clr <= '0'; c1p_clr <= '0'; ch_clr <= '0'; row_col_sel <= "00"; addr2_reg_dec <= '0'; node_write <= '0'; mulreg_en <= '0'; mulreg_clr <= '0'; addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1clr <= '0'; divreg1en <= '0'; numsp_val <= '0'; valid_td <= '0'; incnt_inc <= '0'; initial_run <= '0'; nodeid_sel <= "00"; n_type_sel <= '0'; done <= '0'; cur_st <= idle;elsif clk = '1' and clk'event thencase cur_st is when idle => done <= '0';
node_mem_initialize <= '0'; if valid_numsp = '1' then cur_st <= wait_init; c1p_incr <= '1'; R_dec <= '1'; Rp_dec <= '1';
row_zero <= '1'; rmem_write <= '1';
else cur_st <= idle; c1p_incr <= '0';
row_zero <= '0'; rmem_write <= '0';
end if; when wait_init => R_dec <= '0'; c1p_incr <= '0'; row_zero <= '0'; rmem_write <= '0'; a_inc <= '1'; cur_st <= rmem_init; when rmem_init => Rp_dec <= '0'; a_inc <= '1'; rmem_write <= '1'; if ext_node = '1' then
cur_st <= rmem_init1; row_zero <= '1'; ad_reg_en <= '0'; r_inc <= '1';
else row_zero <= '0'; ad_reg_en <= '1'; r_inc <= '0'; cur_st <= rmem_init;
end if; when rmem_init1 => row_zero <= '0'; if a_grt = '1' then rmem_write <= '0'; ad_reg_en <= '0'; ad_reg_clr <= '1'; a_inc <= '0'; r_inc <= '0'; a_clr <= '1'; r_clr <= '1'; write_wmem <= '1'; cur_st <= node_mem_init; else rmem_write <= '1'; ad_reg_en <= '1'; ad_reg_clr <= '0'; a_inc <= '1'; a_clr <= '0'; r_clr <= '0'; r_inc <= '1'; write_wmem <= '0';
cur_st <= rmem_init1; end if; when node_mem_init => Rp_dec <= '0'; c2_incr <= '0'; c1p_incr <= '0'; if initialized = '1' then node_mem_initialize <= '0'; write_wmem <= '0'; cur_st <= mem_init; mem_initialize <= '1'; c1_incr <= '0'; else node_mem_initialize <= '1'; write_wmem <= '1'; cur_st <= node_mem_init; mem_initialize <= '0'; c1_incr <= '1'; end if; when mem_init => Rp_dec <= '0'; mem_initialize <= '0'; read_mem <= '1'; node_write <= '0'; c1_incr <= '0'; c1p_incr <= '0'; c2_incr <= '1'; c2_clr <= '0'; r_dec <= '0'; rp_dec <= '0'; cur_st <= wait_st; when wait_st => read_mem <= '1'; store_cur_addr <= '1'; initial_run <= '1'; addr_gen1_en <= '1'; c2_incr <= '0'; cur_st <= rmem_read_st; when rmem_read_st => rmem_read <= '1'; store_cur_addr <= '0'; c2_incr <= '0'; c2_load1 <= '0'; c2_load2 <= '0'; addr_gen1_en <= '0'; cur_st <= addr_mod1; when addr_mod1 => rmem_read <= '0'; c2_incr <= '0'; c2_load1 <= '0'; c2_load2 <= '0'; cur_st <= addr_mod2; when addr_mod2 => cur_st <= fetch_dst2; when fetch_dst2 => read_mem <= '1'; node_write <= '0';
if count_gr = '1' then c2_load1 <= '1'; c2_load2 <= '1'; c2_incr <= '0'; else c2_load1 <= '0'; c2_load2 <= '0'; c2_incr <= '1'; end if; cur_st <= compare_dst; when compare_dst => read_mem <= '1'; c1_load1 <= '0'; c1_load2 <= '0'; c2_load1 <= '0'; c2_load2 <= '0'; c2_incr <= '0'; initial_run <= '0'; if (addr_grt = '1') then cur_st <= wait_st1; addr_gen1_en <= '0'; addr_gen2_en <= '0'; store_cur_addr <= '0'; else cur_st <= rmem_read_st; addr_gen1_en <= '1'; addr_gen2_en <= '0'; store_cur_addr <= '1'; end if; when wait_st1 => --wait for four clock cycles so that all data is operated upon if count_cycle < 2 then cur_st <= wait_st1; count_cycle <= count_cycle + 1; c1_load1 <= '0'; c2_load1 <= '0'; c1_clr <= '1'; c2_clr <= '1'; c1p_clr <= '1'; else cur_st <= tree_map_init1; count_cycle <= 0; c1_load1 <= '1'; c2_load1 <= '1'; c1_clr <= '0'; c2_clr <= '0'; end if; when br_update1 => mulreg_clr <= '0'; mulreg_en <='0'; addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0';
divreg1clr <= '0'; incnt_inc <= '0'; valid_td <= '0'; c2_read <= '0'; c1_incr <= '0'; if count_gr = '1' then cur_st <= c2_incr_st; c1_load2 <= '1'; c2_load2 <= '1'; mem_update <= '1'; c2_incr <= '0'; else cur_st <= c2_read_st1; c1_load2 <= '0'; c2_load2 <= '0'; mem_update <= '1'; c2_incr <= '1'; end if; when c2_read_st1 => c2_incr <= '0'; c1_incr <= '1'; c2_read <= '1'; mem_update <= '0'; cur_st <= br_update1; when c2_incr_st => mem_update <= '0'; c1_load2 <= '0'; c2_load2 <= '0'; c2_incr <= '1'; cur_st <= c2_read_st2; when c2_read_st2 => c2_incr <= '0'; c1_incr <= '0'; c2_read <= '1'; mem_update <= '0'; cur_st <= br_update2; when br_update2 => c2_read <= '0'; c1_incr <= '0'; if count_gr = '1' then cur_st <= addr2_gen_st; addr_gen2_en <= '0'; mem_update <= '0'; c2_incr <= '0'; c2_clr <= '1'; c1_clr <= '1'; row_col_sel <= "01"; else cur_st <= c2_read_st22; addr_gen2_en <= '0'; c2_incr <= '1'; mem_update <= '1'; row_col_sel <= "00"; end if; when c2_read_st22 => c2_incr <= '0'; c1_incr <= '1';
c2_read <= '1'; mem_update <= '0'; cur_st <= br_update2; when addr2_gen_st => addr_gen2_en <= '1'; c2_clr <= '0'; c1_clr <= '0'; c2_incr <= '0'; mem_update <= '0'; cur_st <= rmem_read_st2; when rmem_read_st2 => rmem_read <= '1'; addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1clr <= '0'; ch_incr <= '0'; c2_incr <= '0'; write_mem <= '0'; write_wmem <= '0'; mem_update <= '0'; cur_st <= addr_mod1_st; when addr_mod1_st => rmem_read <= '0'; cur_st <= addr_mod2_st; when addr_mod2_st => cur_st <= fetch_dst; when fetch_dst => addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1clr <= '0'; read_mem <= '1'; c2_incr <= '0'; write_mem <= '0'; write_wmem <= '0'; mem_update <= '0'; read_wmem <= '1'; addr_gen2_en <= '1'; rmem_read <= '0'; cur_st <= wait_st2; ch_incr <= '0'; when wait_st2 => --wait for four clock cycles for the data to arrive read_mem <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; addr_gen1_en <= '0';
if count_cycle < 2 then cur_st <= wait_st2; count_cycle <= count_cycle + 1; else cur_st <= mul_dst; count_cycle <= 0; end if; ch_incr <= '0'; when mul_dst => mulreg_en <= '1'; valid_td <= '0'; incnt_inc <= '0'; done <= '0'; addr_gen2_en <= '0'; ch_incr <= '0'; when add_dst => read_mem <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; mulreg_en <= '0'; addregwen <= '1'; addregwclr <= '0'; addregen <= '1'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1clr <= '0'; divreg1en <= '0'; numsp_val <= '0'; valid_td <= '0'; done <= '0'; incnt_inc <= '0'; if child_cnt_gr = '0' then cur_st <= rmem_read_st2; row_col_sel <= "10"; ch_incr <= '1'; else cur_st <= div_dst; row_col_sel <= "11"; ch_incr <= '1'; end if; addr_gen1_en <= '0'; addr_gen2_en <= '1'; when div_dst => ch_incr <= '0'; read_mem <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0';
colreg_clr <= '0'; divreg1clr <= '0'; numsp_val <= '0'; valid_td <= '0'; done <= '0'; incnt_inc <= '0'; if div_valid = '1' then cur_st <= wait_rmem; divreg1en <= '1'; rmem_read <= '1'; else cur_st <= div_dst; divreg1en <= '0'; rmem_read <= '0'; end if; when wait_rmem => rmem_read <= '0'; read_mem <= '0'; read_wmem <= '0'; divreg1en <= '0'; c2_incr <= '1'; cur_st <= write_dist_to_mem; when write_dist_to_mem => rmem_read <= '0'; read_mem <= '0'; write_mem <= '1'; read_wmem <= '0'; write_wmem <= '1'; addregwen <= '0'; addregwclr <= '1'; addregen <= '0'; addregclr <= '1'; mulreg_en <= '0'; mulreg_clr <= '1'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1en <= '0'; divreg1clr <= '0'; numsp_val <= '0'; valid_td <= '0'; done <= '0'; incnt_inc <= '0'; c2_incr <= '0'; cur_st <= write_mem_wait; when write_mem_wait => write_wmem <= '0'; write_mem <= '0'; addregwclr <= '0'; addregclr <= '0'; mulreg_clr <= '0'; c2_incr <= '0'; if count_gr = '1' then cur_st <= mem_init; addr_gen2_en <= '0'; node_write <= '1'; c1p_incr <= '1';
c1p_clr <= '0'; c2_clr <= '1'; r_dec <= '1'; rp_dec <= '1'; row_col_sel <= "00"; else addr_gen2_en <= '1'; cur_st <= rmem_read_st2; node_write <= '0'; c2_clr <= '0'; c1p_incr <= '0'; r_dec <= '0'; rp_dec <= '0'; row_col_sel <= "01"; end if; when tree_map_init1 => read_mem <= '0'; initial_run <= '0'; addr2_reg_dec <= '1'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; mulreg_en <= '0'; mulreg_clr <= '1'; addregwen <= '0'; addregwclr <= '1'; addregen <= '0'; addregclr <= '1'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1en <= '0'; divreg1clr <= '1'; numsp_val <= '0'; nodeid_sel <= "00"; incnt_inc <= '0'; n_type_sel <= '0'; cur_st <= tree_map_init2; valid_td <= '1'; c1_load1 <= '0'; c2_load1 <= '0'; c2_incr <= '1'; done <= '0'; when tree_map_init2 => c2_incr <= '0'; addr2_reg_dec <= '0'; read_mem <= '0'; write_mem <= '0'; initial_run <= '0'; read_wmem <= '0'; write_wmem <= '0'; addregwen <= '0'; addregwclr <= '1'; addregen <= '0'; addregclr <= '1'; distreg_clr <= '0'; rowreg_clr <= '0';
colreg_clr <= '0'; divreg1en <= '0'; divreg1clr <= '1'; numsp_val <= '0'; nodeid_sel <= "01"; n_type_sel <= '0'; valid_td <= '1'; done <= '0'; if all_nodes_done = '1' then cur_st <= tree_map2; incnt_inc <= '0'; c2_read <= '0'; else incnt_inc <= '1'; cur_st <= br_update1; c2_read <= '1'; end if; when tree_map2 => read_mem <= '0'; initial_run <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; addregwen <= '0'; addregwclr <= '1'; addregen <= '0'; addregclr <= '1'; divreg1clr <= '1'; divreg1en <= '0'; distreg_clr <= '1'; rowreg_clr <= '1'; colreg_clr <= '1'; numsp_val <= '0'; nodeid_sel <= "10"; n_type_sel <= '1'; valid_td <= '1'; done <= '0'; incnt_inc <= '0'; cur_st <= done_st; when others => read_mem <= '0'; initial_run <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; addregwen <= '0'; addregwclr <= '0'; addregclr <= '0'; divreg1clr <= '0'; divreg1en <= '0'; numsp_val <= '0'; valid_td <= '0'; incnt_inc <= '0'; done <= '1'; cur_st <= idle;end case;end if;
end process;end ctrl_beh;
-------------------------------------------------------- Counter Entity Architecture pairs-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : counter.vhd-- Entity : counter-- Architecture : counter_beh------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;
entity counter isport (Clk : in std_logic; Res : in std_logic;
ld : in std_logic; clr :in std_logic; inp : in std_logic_vector(9 downto 0); cv : in std_logic_vector(9 downto 0); inc : in std_logic; cnt : out std_logic_vector(9 downto 0); grt : out std_logic);end counter;
architecture counter_beh of counter is
signal count : std_logic_Vector(9 downto 0);
begin
process(clk, res)begin
if res = '1' then count <= (others => '0');
elsif clk = '1' and clk'event then if clr = '1' then count <= (others => '0'); elsif ld = '1' then count <= inp; elsif inc = '1' then
if count < cv then count <= unsigned(count) + '1'; else count <= (others => '0'); end if;
end if;end if;
end process;
process(res, count, cv)
beginif res = '1' then grt <= '0';
elsif count < cv then grt <= '0';
else grt <= '1'; end if;end process;
cnt <= count;
end counter_beh;
------------------------------------------------- child node counter-----------------------------------------------library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;
entity counterch isport (Clk : in std_logic; Res : in std_logic;
ld : in std_logic; clr :in std_logic; inp : in std_logic_vector(1 downto 0); inc : in std_logic; cnt : out std_logic_vector(1 downto 0); grt : out std_logic);end counterch;
architecture counterch_beh of counterch is
constant cv : std_logic_vector(1 downto 0) := "01";
signal count : std_logic_vector(1 downto 0);
begin
process(clk, res)begin
if res = '1' then count <= (others => '0');
elsif clk = '1' and clk'event then if clr = '1' then count <= (others => '0'); elsif ld = '1' then count <= inp; elsif inc = '1' then
if count < cv then count <= unsigned(count) + '1'; else count <= (others => '0');
end if; end if;end if;
end process;
process(res, count)begin
if res = '1' then grt <= '0';
elsif count < cv then grt <= '0';
else grt <= '1'; end if;end process;
cnt <= count;
end counterch_beh;
-------------------------------------------------------- Divider Entity - Architecture Pair-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : divider32new1.vhd-- Entity : divider-- Architecture : divider_beh------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;
entity divider isport(datainp1 : in std_logic_vector(31 downto 0);
divider : in std_logic_vector(15 downto 0); output : out std_logic_vector(31 downto 0); valid : out std_logic);end divider;
architecture divider_beh of divider isprocedure divide_proc (variable a, b : in unsigned (31 downto 0); variable r : out unsigned (63 downto 0); variable ov : out std_logic) is variable temp1 : unsigned (63 downto 0); variable temp2, temp3 : unsigned (32 downto 0); constant C0 : unsigned := "00000000000000000000000000000000"; -- constant zero constant C1 : unsigned := "00000000000000000000000000000001"; -- constant onebegin if (b = C0) then r(31 downto 0) := C0; r(63 downto 32) := C0;
ov := '1'; elsif (a = b) then r(31 downto 0) := C1; r(63 downto 32) := C0; ov := '0'; elsif (a < b) then r(31 downto 0) := C0; r(63 downto 32) := a; ov := '0'; else temp1(31 downto 0) := a; temp1(63 downto 32) := C0; temp3 := "0" & b; for i in 0 to 31 loop temp1(63 downto 1) := temp1(62 downto 0); temp1(0) := '0'; temp2 := "1" & temp1(63 downto 32); temp2 := temp2 - temp3; if (temp2(32) = '1') then temp1(0) := '1'; temp1(63 downto 32) := temp2(31 downto 0); end if; end loop; -- i r := temp1; ov := '0'; end if;end divide_proc;
begin process(datainp1, divider) variable a_inp, b_inp : unsigned(31 downto 0); variable r_sig : unsigned(63 downto 0); variable ov_sig : std_logic; begin a_inp := unsigned(datainp1); b_inp := unsigned("0000000000000000" & divider); divide_proc(a_inp, b_inp, r_sig, ov_sig); output <= std_logic_vector(r_sig(31 downto 0)); valid <= not ov_sig; end process;end divider_beh;
-------------------------------------------------------- Register for first_val signal-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : first_val_reg.vhd-- Entity : first_val_ddd-- Architecture : first_val_ddd_beh------------------------------------------------------library ieee;use ieee.std_logic_1164.all;
entity first_val_ddd is port(
first_val : in std_logic; reset : in std_logic; clk : in std_logic; first_val_ddd_s : out std_logic );end first_val_ddd;
architecture first_val_ddd_beh of first_val_ddd is
signal temp : std_logic; signal temp1 : std_logic;
begin
process(clk, reset) begin
if reset = '1' then temp <= '0'; temp1 <= '0'; first_val_ddd_s <= '0'; elsif (Rising_Edge( clk )) then temp <= first_val; temp1 <= temp; first_val_ddd_s <= temp1; end if; end process; end first_val_ddd_beh;
library ieee;use ieee.std_logic_1164.all;
entity d_f is port(inp : in std_logic; reset : in std_logic; clk : in std_logic; opt : out std_logic);end d_f;
architecture d_f_beh of d_f isbegin process(Clk, Reset) begin if Reset = '1' then opt <= '0'; elsif (Rising_Edge( clk )) then opt <= inp; end if; end process;end d_f_beh;
-------------------------------------------------------- Height Adder Register entity architecture pair------------------------------------------------------
-------------------------------------------------------- Author : Sreesa Akella-- File : addregisterh.vhd-- Entity : adderhreg-- Architecture : addhreg_beh------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;
entity adderhreg isport(addout : in std_logic_vector(15 downto 0);
reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(15 downto 0));end adderhreg;
architecture addhreg_beh of adderhreg isbeginprocess(clk, reset)begin if reset = '1' then regval <= (others => '0'); elsif clk = '1' and clk'event then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= addout; end if; end if;end process;end addhreg_beh;
-------------------------------------------------------- Height Memory entity - architecture-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : hmemory.vhd-- Entity : hmemory-- Architecture : hmem_behave------------------------------------------------------
library ieee, xilinx_lib;use ieee.std_logic_1164.all;use xilinx_lib.VIRTEX.all;
entity hmemory is port ( Clk : in std_logic; Reset : in std_logic; Read : in std_logic; Write : in std_logic;
numsp : in std_logic_vector(9 downto 0); Addr : in std_logic_vector(9 downto 0); Data : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(15 downto 0) );end hmemory;
architecture hmem_behave of hmemory is
signal memAddr : integer; signal Write_En0 : std_logic; signal Write_En1 : std_logic; signal Write_En2 : std_logic; signal Write_En3 : std_logic; signal DataOut0 : std_logic_vector(15 downto 0 ); signal DataOut1 : std_logic_vector(15 downto 0 ); signal DataOut2 : std_logic_vector(15 downto 0 ); signal DataOut3 : std_logic_vector(15 downto 0 ); signal enable : std_logic; begin
--************************************** -- Write logic -- Write_En0 <= Write when (Addr(9 downto 8) = "00") else '0'; Write_En1 <= Write when (Addr(9 downto 8) = "01") else '0'; Write_En2 <= Write when (Addr(9 downto 8) = "10") else '0'; Write_En3 <= Write when (Addr(9 downto 8) = "11") else '0'; --************************************** -- Enable logic -- -- pretty simple or of Read or Write enable <= Read or Write; --************************************** -- Read Logic -- -- This can be combinational Data_out <= DataOut0 when Addr(9 downto 8 ) = "00" else DataOut1 when Addr(9 downto 8 ) = "01" else DataOut2 when Addr(9 downto 8 ) = "10" else DataOut3 when Addr(9 downto 8 ) = "11" else ( others => '0' );
--**************************************
-- Instantiate 4 256x16 BlockRAMS -- U_BR0 : RAMB4_S16 port map ( DI => Data, ADDR => Addr(7 downto 0), CLK => Clk, RST => Reset, WE => Write_En0, EN => enable, DO => DataOut0 ); U_BR1 : RAMB4_S16 port map ( DI => Data, ADDR => Addr(7 downto 0), CLK => Clk, RST => Reset, WE => Write_En1, EN => enable, DO => DataOut1 ); U_BR2 : RAMB4_S16 port map ( DI => Data, ADDR => Addr(7 downto 0), CLK => Clk, RST => Reset, WE => Write_En2, EN => enable, DO => DataOut2 ); U_BR3 : RAMB4_S16 port map ( DI => Data, ADDR => Addr(7 downto 0), CLK => Clk, RST => Reset, WE => Write_En3, EN => enable, DO => DataOut3 );
end hmem_behave;
-------------------------------------------------------- Multiplier entity-architecture
-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : mult.vhd-- Entity : mult-- Architecture : mult_beh------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use work.std_logic_prims.all;
entity mult isport(Datainp1 : in std_logic_vector(31 downto 0);
Datainp2 : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(31 downto 0));end mult;
architecture mult_beh of mult is beginprocess(Datainp1, Datainp2)variable inp1, inp2, outp : integer range 0 to 1000;begin inp1 := std_logic_vector_to_integer(Datainp1); inp2 := std_logic_vector_to_integer(Datainp2); outp := inp1 * inp2; Data_out <= integer_to_std_logic_vector(outp, 31); end process;end mult_beh;
-------------------------------------------------------- Multiplexer entity architecture pairs-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : mux.vhd-- Entity : mux2_1, mux3_1, mux2_1_16-- Architecture : mux2_1_behave, mux3_1_behave,-- mux_2_1_16_behave------------------------------------------------------library ieee;use ieee.std_logic_1164.all;
entity mux2_1 is port(inp1 : in std_logic_vector(9 downto 0); inp2 : in std_logic_vector(9 downto 0); sel : in std_logic; outp : out std_logic_vector(9 downto 0));end mux2_1;
architecture mux2_1_behave of mux2_1 isbegin outp <= inp1 when sel = '0' else inp2;end mux2_1_behave;
library ieee;use ieee.std_logic_1164.all;
entity mux2_1_16 is port(inp1 : in std_logic_vector(15 downto 0); inp2 : in std_logic_vector(15 downto 0); sel : in std_logic; outp : out std_logic_vector(15 downto 0));end mux2_1_16;
architecture mux2_1_16behave of mux2_1_16 isbegin outp <= inp1 when sel = '0' else inp2;end mux2_1_16behave;
library ieee;use ieee.std_logic_1164.all;
entity mux3_1 is port(inp1 : in std_logic_vector(9 downto 0); inp2 : in std_logic_vector(9 downto 0); inp3 : in std_logic_vector(9 downto 0); sel : in std_logic_vector(1 downto 0); outp : out std_logic_vector(9 downto 0));end mux3_1;
architecture mux3_1_behave of mux3_1 isbegin outp <= inp1 when sel = "01" else inp2 when sel = "10" else inp3;end mux3_1_behave;
-------------------------------------------------------- Register that stores number of species value-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : numspreg.vhd-- Entity : numofspreg-- Architecture : numofspreg_beh------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;
entity numofspreg is port( numsp : in std_logic_vector(9 downto 0); reset : in std_logic; clk : in std_logic; valid_in : in std_logic; valid : out std_logic; regval : out std_logic_vector(9 downto 0));
end numofspreg;
architecture numofspreg_beh of numofspreg isbegin
process(clk, reset) begin if reset = '1' then valid <= '0'; regval <= (others => '0'); elsif ( Rising_Edge( clk )) then if valid_in = '1' then regval <= numsp; Valid <= '1'; else Valid <= '0'; regval <= (others => '0'); end if; end if; end process;end numofspreg_beh;
-------------------------------------------------------- Output Selector and Address Generator-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : opt_sel_10.vhd-- Entity : opt_sel-- Architecture : opt_sel_beh------------------------------------------------------
library ieee, xilinx_lib;use ieee.std_logic_1164.all;use xilinx_lib.VIRTEX.all;use ieee.std_logic_arith.all;use work.std_logic_prims.all;
entity opt_sel is port(clk : in std_logic; reset : in std_logic; valid_numsp : in std_logic; numsp : in std_logic_vector(9 downto 0); row_reg : in std_logic_vector(9 downto 0); col_reg : in std_logic_vector(9 downto 0); store_cur_addr : in std_logic; node_mem_initialize : in std_logic; mem_initialize : in std_logic; addr_gen1_en : in std_logic; addr_gen2_en : in std_logic; c2_read : in std_logic; mem_update : in std_logic; addr1_reg_en : in std_logic; addr2_reg_en : in std_logic; R_dec : in std_logic; Rp_dec : in std_logic;
c1_incr : in std_logic; c2_incr : in std_logic; c1p_incr : in std_logic;
ch_incr : in std_logic; c1_load1 : in std_logic; c1_load2 : in std_logic; c2_load1 : in std_logic; c2_load2 : in std_logic; c1p_load : in std_logic;
ch_load : in std_logic; c1_clr : in std_logic;
c2_clr : in std_logic; c1p_clr : in std_logic; ch_clr : in std_logic;
row_col_sel : in std_logic_vector(1 downto 0); node_write : in std_logic; addr2_reg_dec : in std_logic;
distreg_val : in std_logic_vector(31 downto 0); nodeid_sel : in std_logic_vector(1 downto 0); n_type_sel : in std_logic; incnt_inc : in std_logic; initial_run : in std_logic; r_clr : in std_logic; r_inc : in std_logic; a_clr : in std_logic; a_inc : in std_logic; a_grt : out std_logic; ext_node : out std_logic; numsp_1 : out std_logic_vector(9 downto 0); addr_cnt : out std_logic_vector(9 downto 0); row_cnt : out std_logic_vector(9 downto 0); first_val : out std_logic; initialized : out std_logic; addr_grt : out std_logic; child_cnt_gr : out std_logic; count_gr : out std_logic; all_nodes_done : out std_logic; addr : out std_logic_vector(19 downto 0); cnt1 : out std_logic_vector(9 downto 0); nodeid : out std_logic_vector(9 downto 0); n_type : out std_logic; par : out std_logic_vector(9 downto 0); br_len : out std_logic_vector(15 downto 0));end opt_sel; architecture opt_sel_beh of opt_sel is
signal temp_next_int : integer range 0 to 512;
signal incnt : integer range 0 to 511;
signal numsp1, numsp2 : std_logic_vector(9 downto 0);signal numsp_int : integer range 0 to 511;signal valid_numsp_int : std_logic;
signal child_cnt : integer range 0 to 2;
signal max_node_cnt : integer range 0 to 511;
signal all_nd_done : std_logic;
------------------------------------------------------ Address generator signal declarations----------------------------------------------------
signal count1_in, c1_comp_val, count1 : std_logic_vector(9 downto 0);signal c1_grt, c1_inc, c1_load : std_logic;signal count1_16 : std_logic_vector(15 downto 0);signal countp_in, c1p_comp_val, count1p : std_logic_vector(9 downto 0);signal c1p_grt, c1p_inc : std_logic;
signal count2_in, c2_comp_val, count2 : std_logic_vector(9 downto 0);signal c2_grt, c2_grt_d, c2_grt_p, c2_load : std_logic;
signal chcount_in, ch_cnt : std_logic_vector(1 downto 0);signal ch_grt : std_logic;
signal addr1_reg, addr2_reg : std_logic_vector(9 downto 0);signal addr1_reg_d, addr2_reg_d : std_logic_vector(9 downto 0);signal addr1_reg_dd, addr2_reg_dd : std_logic_vector(9 downto 0);signal addr1_reg_ddd, addr2_reg_ddd : std_logic_vector(9 downto 0);signal addr1_reg_dddd, addr2_reg_dddd : std_logic_vector(9 downto 0);signal R, Rp : std_logic_vector(9 downto 0);
signal count2_in_sel : std_logic_vector(1 downto 0);
---------------------------------------------------------- Block ram signalssignal din_a, din_b, din_b1 : std_logic_vector(15 downto 0);signal ena, wea, enb, web : std_logic;signal addra, addrb : std_logic_vector(15 downto 0);signal addra0, addra1 : std_logic_vector(15 downto 0);signal addrb0, addrb1 : std_logic_vector(15 downto 0);signal addr_temp : std_logic_vector(9 downto 0);signal addr_grt_t, addr_grt_tt, addr_grt_ttt : std_logic;signal addr_t : std_logic_vector(19 downto 0);
signal wea0, wea1, web0, web1 : std_logic;signal ena0, ena1, enb0, enb1 : std_logic;------------------------------------------------------ Row memory map counters signals----------------------------------------------------signal r_cnt : std_logic_vector(9 downto 0);signal a_cval2 : integer range 0 to 511;signal a_cnt : std_logic_vector(9 downto 0);signal e_n : std_logic;------------------------------------------------------ Address generator component declarations----------------------------------------------------component counter
port(Clk : in std_logic; Res : in std_logic;
ld : in std_logic; clr :in std_logic; inp : in std_logic_vector(9 downto 0); cv : in std_logic_vector(9 downto 0); inc : in std_logic; cnt : out std_logic_vector(9 downto 0); grt : out std_logic);
end component;
component counterch port(Clk : in std_logic;
Res : in std_logic; ld : in std_logic; clr :in std_logic; inp : in std_logic_vector(1 downto 0); inc : in std_logic; cnt : out std_logic_vector(1 downto 0); grt : out std_logic);
end component;
component mux2_1 port(inp1 : in std_logic_vector(9 downto 0);
inp2 : in std_logic_vector(9 downto 0); sel : in std_logic; outp : out std_logic_vector(9 downto 0));
end component;
component mux2_1_16 port(inp1 : in std_logic_vector(15 downto 0);
inp2 : in std_logic_vector(15 downto 0); sel : in std_logic; outp : out std_logic_vector(15 downto 0));end component; component mux3_1 port(inp1 : in std_logic_vector(9 downto 0);
inp2 : in std_logic_vector(9 downto 0); inp3 : in std_logic_vector(9 downto 0); sel : in std_logic_vector(1 downto 0); outp : out std_logic_vector(9 downto 0));
end component;
begin
process(incnt)begin temp_next_int <= incnt + 1;end process;
process(clk, reset)begin if (reset = '1') then all_nodes_done <= '0'; all_nd_done <= '0'; elsif (clk = '1' and clk'event) then if temp_next_int > max_node_cnt then all_nodes_done <= '1' ;
all_nd_done <= '1'; else all_nodes_done <= '0'; all_nd_done <= '0'; end if; end if;end process;
process(clk, reset)begin if (reset = '1') then max_node_cnt <= 0; a_cval2 <= 0; elsif (clk = '1' and clk'event) then if valid_numsp_int = '1' then max_node_cnt <= (2* numsp_int); a_cval2 <= 2*numsp_int - 1; end if; end if;end process;
process(clk, reset)begin if (reset = '1') then numsp_int <= 0; valid_numsp_int <= '0'; numsp1 <= (others => '0'); elsif (clk = '1' and clk'event) then if valid_numsp = '1' then numsp1 <= integer_to_std_logic_vector(
(std_logic_vector_to_integer(numsp) - 1), 9); numsp_int <= std_logic_vector_to_integer(numsp) - 1; valid_numsp_int <= '1'; end if; end if;end process;
numsp_1 <= numsp1;
process(clk, reset)begin if (reset = '1') then incnt <= 0; elsif (clk = '1' and clk'event) then if valid_numsp = '1' then incnt <= std_logic_vector_to_integer(numsp); elsif incnt_inc = '1' then incnt <= incnt + 1; end if; end if;end process;
--opt sel processes
--n_type output select
process(n_type_sel)begin n_type <= n_type_sel;end process;
--nodeid output selectprocess(nodeid_sel, row_reg, col_reg, incnt)begin case nodeid_sel is when "00" => nodeid <= row_reg; when "01" => nodeid <= col_reg; when others => nodeid <= integer_to_std_logic_vector(incnt, 9); end case;end process;
--par selectprocess(incnt)begin par <= integer_to_std_logic_vector(incnt, 9);end process;
--br_len selectprocess(distreg_val)begin br_len <= '0' & distreg_val(15 downto 1);end process;
-- Signal indicating first value to be stored in least distance registerprocess(reset, clk)begin if reset = '1' then first_val <= '0'; elsif clk = '1' and clk'event then if (initial_run = '1') then first_val <= '1'; else first_val <= '0'; end if; end if;end process;
-------------------------------------------------------------------------------------Address Generator----------------------------------------------------------------------------------U0 : counter port map(Clk, Reset, c1_load, c1_clr, count1_in,
c1_comp_val, c1_inc, count1, c1_grt);U1 : counter port map(Clk, Reset, c1p_load, c1p_clr, countp_in,
c1p_comp_val, c1p_inc, count1p, c1p_grt);U2 : counter port map(Clk, Reset, c2_load, c2_clr, count2_in,
c2_comp_val, c2_incr, count2, c2_grt);u3 : counterch port map(Clk, Reset, ch_load, ch_clr, chcount_in,
ch_incr, ch_cnt, ch_grt);U4 : mux2_1 port map(addr2_reg, addr1_reg, c1_load1, count1_in);u5 : mux2_1 port map(Rp, R, node_mem_initialize, c1_comp_val);u6 : mux3_1 port map(addr2_reg, addr1_reg, count1p, count2_in_sel,
count2_in);u7 : mux2_1 port map(R, Rp, addr_gen2_en, c2_comp_val);u8 : mux2_1_16 port map(addrb, count1_16, node_mem_initialize, din_a);
count1_16 <= "000000" & count1;
count2_in_sel <= c2_load1&c2_load2;chcount_in <= "00";c1p_comp_val <= R;countp_in <= (others => '0');
c1_inc <= c1_incr or (c2_grt_p and not addr_gen2_en);c1p_inc <= c1p_incr or (c2_grt_p and not addr_gen2_en);
c1_load <= c1_load1 or c1_load2;c2_load <= c2_load1 or c2_load2;
--------------------------------------------------------------- Ena, Wea, Enb, and Web signals
ena <= node_mem_initialize or mem_initialize or addr_gen1_en or mem_update;wea <= node_mem_initialize or mem_update;enb <= mem_initialize or addr_gen1_en or addr_gen2_en or c2_read or node_write;web <= node_write; --c2_grt and addr_gen2_en;---------------------------------------------------------------- Dia and Dib input data
process (Clk, Reset)begin if Reset = '1' then din_b <= (others => '0'); elsif Clk = '1' and Clk'event then if incnt_inc = '1' then din_b <= integer_to_std_logic_vector(incnt, 15); end if; end if;end process;
--************************************** -- Write logic -- --
wea0 <= wea when (count1(9 downto 8) = "00") else '0'; wea1 <= wea when (count1(9 downto 8) = "01") else '0'; web0 <= web when (count2(9 downto 8) = "00") else '0'; web1 <= web when (count2(9 downto 8) = "01") else '0';
--************************************** -- Enable logic
-- -- ena0 <= ena when (count1(9 downto 8) = "00") else '0'; ena1 <= ena when (count1(9 downto 8) = "01") else '0'; enb0 <= enb when (count1(9 downto 8) = "00") else '0'; enb1 <= enb when (count1(9 downto 8) = "01") else '0'; --************************************** -- Read Logic -- -- addra <= addra0 when count1(9 downto 8) = "00" else addra1 when count1(9 downto 8) = "01" else ( others => '0' ); addrb <= addrb0 when count2(9 downto 8) = "00" else addrb1 when count2(9 downto 8) = "01" else ( others => '0' );
u9 : RAMB4_S16_S16 port map ( ADDRA => count1(7 downto 0), DIA => din_a, WEA => wea0, CLKA => Clk, RSTA => Reset, ENA => ena0, DOA => addra0,
ADDRB => count2(7 downto 0), DIB => din_b, WEB => web0, CLKB => Clk, RSTB => Reset, ENB => enb0, DOB => addrb0 );
u10 : RAMB4_S16_S16 port map ( ADDRA => count1(7 downto 0), DIA => din_a, WEA => wea1, CLKA => Clk, RSTA => Reset, ENA => ena1,
DOA => addra1,
ADDRB => count2(7 downto 0), DIB => din_b, WEB => web1, CLKB => Clk, RSTB => Reset, ENB => enb1, DOB => addrb1 );
------------------------------------------- addr1_reg and addr2_reg
process(clk, reset)begin if Reset = '1' then addr1_reg <= (others => '0'); elsif Clk'event and Clk = '1' then if addr1_reg_en = '1' then addr1_reg <= addr1_reg_d; end if; end if;end process;
process (Clk, Reset)begin if Reset = '1' then addr1_reg_d <= (others => '0'); addr2_reg_d <= (others => '0'); elsif Clk = '1' and Clk'event then if store_cur_addr = '1' then addr2_reg_d <= count1; addr2_reg_d <= count2; end if; end if;end process;
process(clk, reset)begin if Reset = '1' then addr2_reg <= (others => '0'); elsif Clk'event and Clk = '1' then if addr2_reg_en = '1' then addr2_reg <= addr2_reg_d; elsif addr2_reg_dec = '1' then addr2_reg <= unsigned(addr2_reg) - 1; end if; end if;end process;
-------------------------------------------- R and Rp registers
process(Clk, Reset)begin if Reset = '1' then
R <= (others => '0'); elsif Clk = '1' and Clk'event then if valid_numsp = '1' then R <= numsp; --(9 downto 0); elsif R_dec = '1' then R <= unsigned(R) - '1'; end if; end if;end process;
process(Clk, Reset)begin if Reset = '1' then
Rp <= (others => '0'); elsif Clk = '1' and Clk'event then if valid_numsp = '1' then Rp <= numsp; --(9 downto 0); elsif Rp_dec = '1' then Rp <= unsigned(Rp) - '1'; end if; end if;end process;
process(clk, reset)begin if reset = '1' then c2_grt_d <= '0'; elsif clk'event and clk = '1' then c2_grt_d <= c2_grt; end if;end process;
c2_grt_p <= c2_grt and not(c2_grt_d);
addr <= addr_temp & addrb(9 downto 0) when addr_gen2_en = '1' else addra(9 downto 0) & addrb(9 downto 0) ;
-- addr_temp should be setaddr_temp <= row_reg when row_col_sel = "01" else col_reg when row_col_sel = "10" else din_b(9 downto 0); count_gr <= c2_grt or all_nd_done;
child_cnt_gr <= ch_grt;
cnt1 <= count1;
process(clk, reset)begin if reset = '1' then addr_grt <= '0'; addr_grt_t <= '0'; addr_grt_tt <= '0';
addr_grt_ttt <= '0'; elsif Clk'event and Clk = '1' then addr_grt_t <= c1_grt; addr_grt_tt <= addr_grt_t; addr_grt_ttt <= addr_grt_tt; addr_grt <= addr_grt_ttt; end if;end process;
initialized <= c1_grt;
------------------------------------------------- counters for row map memory initialization-----------------------------------------------process(Clk, Reset)begin if Reset = '1' then r_cnt <= (others => '0'); elsif Clk'event and Clk = '1' then if r_clr = '1' then r_cnt <= (others => '0'); elsif r_inc = '1' then if r_cnt < R then r_cnt <= unsigned(r_cnt) + '1'; else r_cnt <= (others => '0'); end if; end if; end if;end process;
row_cnt <= r_cnt; process(Clk, Reset)begin if Reset = '1' then a_cnt <= (others => '0'); elsif Clk'event and Clk = '1' then if a_clr = '1' then a_cnt <= (others => '0'); elsif a_inc = '1' then if a_cnt < integer_to_std_logic_vector(a_cval2, 9) then a_cnt <= unsigned(a_cnt) + '1'; else a_cnt <= (others => '0'); end if; end if; end if;end process;
addr_cnt <= a_cnt;
process(Reset, a_cnt, a_cval2, numsp1)begin if Reset = '1' then e_n <= '0'; a_grt <= '0';
elsif a_cnt < numsp1 then e_n <= '0'; a_grt <= '0'; elsif a_cnt < integer_to_std_logic_vector(a_cval2, 9) then e_n <= '1'; a_grt <= '0'; else e_n <= '1'; a_grt <= '1'; end if;end process;
ext_node <= e_n;
end opt_sel_beh;
-------------------------------------------------------- Package for defining ieee std_logic primitives-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : pack.vhd-- Entity : NA-- Architecture : NA------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;
-- -- package for defining ieee std_logic primitives --package std_logic_prims is -- constant definitions constant bit_width : integer := 63; constant bit_widthx2 : integer := 127; -- -- function std_logic_vector_to_integer converts its -- std_logic_vector argument ibus, assumed to consist of '0' -- and '1' elements only, into an integer. -- function std_logic_vector_to_integer(ibus: in std_logic_vector) return integer; -- -- function integer_to_std_logic_vector converts its integer -- argument ival into a std_logic_vector of length n. -- function integer_to_std_logic_vector(val, n: in integer) return std_logic_vector;
end std_logic_prims;
package body std_logic_prims is -- -- function std_logic_vector_to_integer converts its -- std_logic_vector argument ibus, assumed to consist of '0'
-- and '1' elements only, into an integer. -- function std_logic_vector_to_integer(ibus: in std_logic_vector) return integer is variable result: integer := 0; begin for i in ibus'high downto 0 loop result := result * 2; if ibus(i) = '1' then result := result + 1; end if; end loop; return result; end std_logic_vector_to_integer; -- -- function integer_to_std_logic_vector converts its integer -- argument ival into a std_logic_vector of length n. -- function integer_to_std_logic_vector(val, n: in integer)return std_logic_vector is variable result: std_logic_vector(n downto 0); variable ival: integer := val; begin ival := val; for i in 0 to n loop if (ival mod 2) = 1 then result(i) := '1'; else result(i) := '0'; end if; ival := ival / 2; end loop;return result; end integer_to_std_logic_vector;
end std_logic_prims;
---------------------------------------------------------------------------- Entity : PE---- Architecture : pe_upgma_arch-- -- Author : Sreesa Akella---- Filename : pe_upgma_arch.vhd---- Description : PE architecture that implements the UPGMA design--------------------------------------------------------------------------
------------------------------- Glossary ----------------------------------- Name Key:-- =========-- _AS : Address Strobe
-- _CE : Clock Enable-- _CS : Chip Select-- _DS : Data Strobe-- _EN : Enable-- _OE : Output Enable-- _RD : Read Select-- _WE : Write Enable-- _WR : Write Select-- _d[d...] : Delayed (registered) signal (each 'd' denotes one-- level of delay)-- _n : Active low signals (must be last part of name)---- Port Name Dir Description-- ============================ === ================================-- Pads.Clocks.F_Clk I Frequency synthesizer clock-- Pads.Clocks.M_Clk I Memory clock-- Pads.Clocks.P_Clk I Processor clock-- Pads.Clocks.K_Clk I LAD-bus clock-- Pads.Clocks.IO_Clk I External I/O connector clock-- Pads.Clocks.M_Clk_Out_Pe O M_Clk to the PE-- Pads.Clocks.M_Clk_Out_CB_Ctrl O M_Clk to the CardBus controller-- Pads.Clocks.M_Clk_Out_Right_Mem O M_Clk to the right memory bank-- Pads.Clocks.M_Clk_Out_Left_Mem O M_Clk to the left memory bank-- Pads.Clocks.P_Clk_Out_Pe O P_Clk to the PE-- Pads.Clocks.P_Clk_Out_CB_Ctrl O P_Clk to the CardBus controller-- Pads.Reset I Global PE reset-- Pads.Audio O Pulse-width modulated audio pad-- Pads.LAD_Bus.Addr_Data B LAD-bus shared address/data bus-- Pads.LAD_Bus.AS_n I LAD-bus address strobe-- Pads.LAD_Bus.DS_n I LAD-bus data strobe-- Pads.LAD_Bus.Ack_n O LAD-bus acknowledge strobe-- Pads.LAD_Bus.Reg_n I LAD-bus register select-- Pads.LAD_Bus.WR_n I LAD-bus write select-- Pads.LAD_Bus.CS_n I LAD-bus chip select-- Pads.LAD_Bus.Int_Req_n O LAD-bus interrupt request-- Pads.LAD_Bus.DMA_0_Data_OK_n O LAD-bus DMA chan 0 data OK flag-- Pads.LAD_Bus.DMA_0_Burst_OK_n O LAD-bus DMA chan 0 burst OK flag-- Pads.LAD_Bus.DMA_1_Data_OK_n O LAD-bus DMA chan 1 data OK flag-- Pads.LAD_Bus.DMA_1_Burst_OK_n O LAD-bus DMA chan 1 burst OK flag-- Pads.LAD_Bus.Reg_Data_OK_n O LAD-bus reg space data OK flag-- Pads.LAD_Bus.Reg_Burst_OK_n O LAD-bus reg space burst OK flag-- Pads.LAD_Bus.Force_K_Clk_n O LAD-bus K_Clk forced-run select-- Pads.LAD_Bus.Reserved - Reserved for future use-- Pads.Left_Mem.Addr O Left memory address bus-- Pads.Left_Mem.Data B Left memory data bus-- Pads.Left_Mem.Byte_WR_n O Left memory byte write select-- Pads.Left_Mem.CS_n O Left memory chip select-- Pads.Left_Mem.CE_n O Left memory clock enable-- Pads.Left_Mem.WE_n O Left memory write enable-- Pads.Left_Mem.OE_n O Left memory output enable-- Pads.Left_Mem.Sleep_EN O Left memory sleep enable-- Pads.Left_Mem.Load_EN_n O Left memory load enable-- Pads.Left_Mem.Burst_Mode O Left memory burst mode select-- Pads.Right_Mem.Addr O Right memory address bus-- Pads.Right_Mem.Data B Right memory data bus-- Pads.Right_Mem.Byte_WR_n O Right memory byte write select-- Pads.Right_Mem.CS_n O Right memory chip select
-- Pads.Right_Mem.CE_n O Right memory clock enable-- Pads.Right_Mem.WE_n O Right memory write enable-- Pads.Right_Mem.OE_n O Right memory output enable-- Pads.Right_Mem.Sleep_EN O Right memory sleep enable-- Pads.Right_Mem.Load_EN_n O Right memory load enable-- Pads.Left_Mem.Burst_Mode O Right memory burst mode select-- Pads.Left_IO B Left external I/O connector-- Pads.Right_IO B Right external I/O connector--------------------------------------------------------------------------
-------------------------- Library Declarations ------------------------
library IEEE;use IEEE.std_logic_1164.all;use IEEE.std_logic_arith.all;use work.std_logic_prims.all;
library PE_Lib;use PE_Lib.PE_Package.all;
library LAD_Mux_Lib;use LAD_Mux_lib.LAD_Mux_Pkg.all;use LAD_Mux_lib.LAD_Mem32_Mux_pkg.all;
Library Mem_Mux_Lib;use Mem_Mux_lib.Mem32_Mux_pkg.all;
library DMA_Mux_Lib;use DMA_Mux_Lib.DMA_Mux_Pkg.all;use DMA_Mux_Lib.DMA_LAD_Mem32_Mux_Pkg.all;
------------------------ Architecture Declaration ----------------------
architecture pe_upgma_arch of PE is
------------------------------- Glossary ----------------------------- -- -- Name Key: -- ========= -- _AS : Address Strobe -- _CB : CardBus -- _CE : Clock Enable -- _CS : Chip Select -- _DS : Data Strobe -- _EN : Enable -- _OE : Output Enable -- _PE : Processing Element -- _RD : Read Select -- _WE : Write Enable -- _WR : Write Select -- _d[d...] : Delayed (registered) signal (each 'd' denotes one -- level of delay) -- _n : Active low signals (must be last part of name) -- -- Name Width Dir Description
-- ========================= ===== === ================================ -- Clocks_In.F_Clk 1 I Frequency synthesizer clock -- Clocks_In.M_Clk 1 I Memory clock -- Clocks_In.P_Clk 1 I Processing element clock -- Clocks_In.K_Clk 1 I LAD-bus clock -- Clocks_In.F_Clk_Locked 1 I U_Clk CLKDLL locked flag -- Clocks_In.M_Clk_Locked 1 I M_Clk CLKDLL locked flag -- Clocks_In.P_Clk_Locked 1 I P_Clk CLKDLL locked flag -- Global_Reset 1 I Global reset (or set) signal -- Audio_Out 1 O Pulse-width modulated audio -- output -- LAD_Mux_Bus(x).Addr 20 I LAD bus DWORD address bus
input -- LAD_Mux_Bus(x).Write 1 I LAD bus write select -- LAD_Mux_Bus(x).Strobe 1 I LAD bus register access strobe -- LAD_Mux_Bus(x).Mem_Strobe 1 I LAD bus memory access strobe -- LAD_Mux_Bus(x).DMA_0_Strobe 1 I LAD bus DMA channel 0 access -- strobe -- LAD_Mux_Bus(x).DMA_1_Strobe 1 I LAD bus DMA channel 1 access -- strobe -- LAD_Mux_Bus(x).DMA_0_Done 1 I DMA CH0 Completed signal -- LAD_Mux_Bus(x).DMA_1_Done 1 I DMA CH1 Completed signal -- LAD_Mux_Bus(x).Reset 1 I LAD bus reset signal -- LAD_Mux_Bus(x).Data_In 32 I LAD bus data bus input -- LAD_Mux_Bus(x).Data_Out 32 O LAD bus data bus output -- LAD_Mux_Bus(x).Akk 1 O LAD bus transaction
acknowledge -- LAD_Mux_Bus(x).Int_Req 1 O LAD bus interrupt request -- LAD_Mux_Bus(x).DMA_0_Stat 2 O LAD bus DMA Channel 0 status
flags -- LAD_Mux_Bus(x).DMA_1_Stat 2 O LAD bus DMA Channel 0 status
flags -- -- Left_Mem_Mux(x).Addr 32 O Left on-board memory -- address bus -- Left_Mem_Mux(x).Write 1 O Left on-board memory write -- select -- Left_Mem_Mux(x).Data_Out 32 O Right on-board memory output -- data bus -- Left_Mem_Mux(x).Req 1 O Left on-board memory access -- request -- Left_Mem_Mux(x).Akk 1 O Left on-board memory access -- acknowledge -- Left_Mem_Mux(x).Data_In 32 I Left on-board memory input -- data bus -- Left_Mem_Mux(x).Data_Valid 1 I Left on-board memory valid -- read flag -- -- Right_Mem_Mux(x).Addr 32 O Right on-board memory -- address bus -- Right_Mem_Mux(x).Write 1 O Right on-board memory write -- select -- Right_Mem_Mux(x).Data_Out 32 O Right on-board memory output -- data bus -- Right_Mem_Mux(x).Req 1 O Right on-board memory access -- request
-- Right_Mem_Mux(x).Akk 1 O Right on-board memory access -- acknowledge -- Right_Mem_Mux(x).Data_In 32 I Right on-board memory input -- data bus -- Right_Mem_Mux(x).Data_Valid 1 I Right on-board memory valid -- read flag -- -- Left_IO_In.Data_In 13 I Left I/O connector data -- input -- Left_IO_Out.Data_Out 13 O Left I/O connector data -- output -- Left_IO_Out.Data_OE_n 13 O Left I/O connector data -- output enable -- Right_IO_In.Data_In 13 I Right I/O connector data -- input -- Right_IO_Out.Data_Out 13 O Right I/O connector data -- output -- Right_IO_Out.Data_OE_n 13 O Right I/O connector data -- output enable -- ----------------------------------------------------------------------
---------------------------------------------------------------------- -- -- Below are all of the standard PE pad interface signals. Simply -- uncomment the signal(s) that are needed by the PE design. All -- other unused signals may remain commented out. Be sure to -- uncomment any component instances used by the interface. -- ----------------------------------------------------------------------
signal Clocks_In : Clock_Std_IF_In_Type; signal Global_Reset : Reset_Std_IF_In_Type := '0'; -- signal Audio_Out : Audio_Std_IF_Out_Type; -- signal Left_IO_In : IO_Conn_Std_IF_In_Type; -- signal Left_IO_Out : IO_Conn_Std_IF_Out_Type; -- signal Right_IO_In : IO_Conn_Std_IF_In_Type; -- signal Right_IO_Out : IO_Conn_Std_IF_Out_Type;
---------------------------------------------------------------------- -- -- Below are all of the multiplexing PE pad interface signals. Simply -- uncomment the signal(s) that are needed by the PE design and -- increase the vector sizes as needed. All other unused signals may -- remain commented out. Be sure to uncomment any component -- instances used by the interface. -- ----------------------------------------------------------------------
signal LAD_Mux_Bus : LAD_Mux_vector(0 to 2); signal Left_Mem_Mux : Mem32_Mux_vector(0 to 1); signal Right_Mem_Mux : Mem32_Mux_vector(0 to 1); signal LAD_Regs : LAD_Mux_register_vector(0 to 1);
---------------------------------------------------------------------- -- Component declaration of UPGMA top component
---------------------------------------------------------------------- component upgma_top is port( clk : in std_logic; Reset : in std_logic; Data_in : in std_logic_vector(31 downto 0); valid_numsp : in std_logic; valid_dst : in std_logic; addr_cnt : out std_logic_vector(9 downto 0); row_cnt : out std_logic_vector(9 downto 0); ext_node : out std_logic; rmem_read : out std_logic; rmem_write : out std_logic; ad_reg_en : out std_logic; ad_reg_clr : out std_logic; row_zero : out std_logic; mem_addr : out std_logic_vector(19 downto 0); read_mem : out std_logic; write_mem : out std_logic; numsp_1 : out std_logic_vector(9 downto 0); numsp : in std_logic_vector(9 downto 0); avg_dst : out std_logic_vector(31 downto 0); trout : out std_logic_vector(36 downto 0); valid_td : out std_logic; done : out std_logic ); end component;
---------------------------------------------------------------------- -- UPGMA top component signal declarations ----------------------------------------------------------------------
signal Data_in : std_logic_vector(31 downto 0); signal valid_numsp : std_logic; signal valid_dst : std_logic; signal mem_addr, mem_addr_t : std_logic_vector(19 downto 0); signal read_mem : std_logic; signal write_mem : std_logic; signal numsp : std_logic_vector(9 downto 0); signal avg_dst : std_logic_vector(31 downto 0); signal trout : std_logic_vector(36 downto 0); signal valid_td : std_logic; signal done : std_logic; signal val_nsp_d : std_logic; signal val_nsp_dd : std_logic; signal done_d : std_logic; signal rmem_read : std_logic; signal rmem_write : std_logic; signal ad_reg_en : std_logic; signal ad_reg_clr : std_logic; signal row_zero : std_logic; signal ext_node : std_logic; signal row_cnt, addr_cnt : std_logic_vector(9 downto 0); signal numsp_1 : std_logic_vector(9 downto 0); ---------------------------------------------------------------------- -- Address modification component
---------------------------------------------------------------------- component addr_mod port (Clk : in std_logic; Reset : in std_logic; addr_cnt : in std_logic_vector(9 downto 0); row_cnt : in std_logic_vector(9 downto 0); mem_addr : in std_logic_vector(19 downto 0);
numsp : in std_logic_vector(9 downto 0); numsp_1 : in std_logic_vector(9 downto 0);
ext_node : in std_logic; rmem_read : in std_logic; rmem_write : in std_logic; ad_reg_en : in std_logic; ad_reg_clr : in std_logic;
row_zero : in std_logic; addr_modif : out std_logic_vector(15 downto 0)); end component; ---------------------------------------------------------------------- -- Address modification component signals ---------------------------------------------------------------------- signal addr_modif : std_logic_vector(15 downto 0);
---------------------------------------------------------------------- -- Memory input signals -- Left memory and right memory input signals ---------------------------------------------------------------------- signal left_mem_data_Addr : std_logic_vector(31 downto 0); signal left_mem_data_Write : std_logic; signal left_mem_data_Data_Out : std_logic_vector(31 downto 0); signal left_mem_data_Req : std_logic; signal right_mem_data_Addr : std_logic_vector(31 downto 0); signal right_mem_data_Write : std_logic; signal right_mem_data_Data_Out : std_logic_vector(31 downto 0); signal right_mem_data_Req : std_logic; signal en : std_logic;begin------------------------------------------------------------------------------- The following two components create a block RAM bridge from the LAD-- bus to the onboard left and right memories. Use the -- LAD_Mem_Bridge.c/.h source files to write data to these components --- from the host.---- Each component needs a unique LAD_Mux_Bus, and either a unique-- Left_Mem_Mux or Right_Mem_Mux.---- The Left Memory is located at address 0x1000 and the Right Memory-- at address 0x1200.---- Physically these addresses come into the PE as 0x5000 and 0x5200,-- however the 0x4000 REGISTER base is subtracted from the Address
-- if the USE_OLD_ADDRESSES generic is FALSE. This new address scheme-- was added to make the WC_PeRegRead and WC_PeRegWrite addresses match-- the addresses in the VHDL code.------------------------------------------------------------------------------
U_Left_Bridge : LAD_Mem32_Bridge generic map ( BASE => x"1000" ) port map ( Kclk => Clocks_In.K_clk, LAD => LAD_Mux_Bus(1),
Mclk => Clocks_In.M_Clk, Mem => Left_Mem_Mux(0) );
U_Right_Bridge : LAD_Mem32_Bridge generic map ( BASE => x"1200" ) port map ( Kclk => Clocks_In.K_clk, LAD => LAD_Mux_Bus(0),
Mclk => Clocks_In.M_Clk, Mem => Right_Mem_Mux(0) );
---------------------------------------------------------------------------- -- -- Instantiated a LAD_Mux_Register file of size 1 -- This single 32 bit register would store the value of the numsp -- ---------------------------------------------------------------------------- U_LAD_Mux_Reg : LAD_Mux_RegFile generic map ( BASE => x"2000", L2NUM => 1 ) port map ( Kclk => Clocks_In.K_Clk, LAD => LAD_Mux_Bus(2),
Regs => LAD_Regs );
-- Tie the register output to the input so it can be read back
LAD_Regs(0).Data_Out <= LAD_Regs(0).Data_In;
---------------------------------------------------------------------- -- -- Below are all of the standard PE pad interface components. Simply -- uncomment the interface(s) that are needed by the PE design. All -- other unused interfaces may remain commented out. Be sure to -- uncomment any signal declarations used by the interface. -- ----------------------------------------------------------------------
-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@-- --@@-- --@@ CLOCK STANDARD Interface. Uncomment this component-- --@@ to use K,M,P and F clocks. (This should almost-- --@@ always be uncommented.)-- --@@-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@-- U_Clocks : Clock_Std_IF generic map ( USE_EXT_P_CLK_SOURCE => FALSE, REVISION => REVD ) port map ( Global_Reset => Global_Reset, Pads => Pads.Clocks, User_In => Clocks_In );
--@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ LAD MUX INTERFACE --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
U_LAD_MUX : LAD_Mux_IF generic map ( USE_OLD_ADDRESSES => FALSE ) port map ( Kclk => Clocks_In.K_Clk, Reset => Global_Reset, Pads => Pads.LAD_Bus, Clients => LAD_Mux_Bus );
--@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ LEFT MEMORY MUX INTERFACE : The two interfaces below --@@ are mutually exclusive. Uncomment either the --@@ Mem32_Mux_Priority_IF or the Mem32_Mux_Fair_IF --@@ to use the left memory bank. --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
U_Left_Mem_Mux : Mem32_Mux_Priority_IF generic map ( AVOID_OVERFLOW => TRUE, NUM_AKK_FIFOS => 0, ROTATE_PRIORITY => FALSE, STICKY_PRIORITY => FALSE, REGISTERED_AKKS => FALSE, REGISTERED_REQS => FALSE ) port map ( Mclk => Clocks_In.M_Clk, Reset => Global_Reset, Pads => Pads.Left_Mem, Clients => Left_Mem_Mux );
-- U_Left_Mem_Mux : Mem32_Mux_Fair_IF-- generic map-- (-- AVOID_OVERFLOW => TRUE,-- REGISTER_DATA => FALSE-- )-- port map-- (-- Mclk => Clocks_In.M_Clk,-- Reset => Global_Reset,-- Pads => Pads.Left_Mem.-- Clients => Left_Mem_Mux-- );---- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ RIGHT MEMORY MUX INTERFACE : The two interfaces below --@@ are mutually exclusive. Uncomment either the --@@ Mem32_Mux_Priority_IF or the Mem32_Mux_Fair_IF --@@ to use the right memory bank. --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
U_Right_Mem_Mux_IF : Mem32_Mux_Priority_IF generic map (
AVOID_OVERFLOW => TRUE, NUM_AKK_FIFOS => 0, ROTATE_PRIORITY => FALSE, STICKY_PRIORITY => FALSE, REGISTERED_AKKS => FALSE, REGISTERED_REQS => FALSE ) port map ( Mclk => Clocks_In.M_Clk, Reset => Global_Reset, Pads => Pads.Right_Mem, Clients => Right_Mem_Mux );
-- U_Right_Mem_Mux : Mem32_Mux_Fair_IF-- generic map-- (-- AVOID_OVERFLOW => TRUE,-- REGISTER_DATA => FALSE-- )-- port map-- (-- Mclk => Clocks_In.M_Clk,-- Reset => Global_Reset,-- Pads => Pads.Right_Mem.-- Clients => Right_Mem_Mux-- );-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ RESET INTERFACE : The following component provides --@@ a global reset to the entire PE. The Global_Reset --@@ signal is also tied to the GSR port of the --@@ STARTUP VIRTEX. This component should almost --@@ always be uncommented. --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
U_Reset : Reset_Std_IF port map ( Clk => Clocks_In.K_Clk, Pads => Pads.Reset, User_In => Global_Reset );
--@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ UPGMA top component: The following component --@@ reads the distance data from the left and --@@ right memories and reconstructs the phylogene- --@@ -tic tree using the UPGMA algorithm --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ U_UPGMA_TOP : upgma_top port map (
clk => Clocks_In.P_Clk, Reset => Global_Reset, Data_in => Data_in, valid_numsp => valid_numsp, valid_dst => valid_dst, addr_cnt => addr_cnt, row_cnt => row_cnt, ext_node => ext_node, rmem_read => rmem_read, rmem_write => rmem_write, ad_reg_en => ad_reg_en, ad_reg_clr => ad_reg_clr, row_zero => row_zero, mem_addr => mem_addr, read_mem => read_mem, write_mem => write_mem, numsp_1 => numsp_1, numsp => numsp, avg_dst => avg_dst, trout => trout, valid_td => valid_td, done => done ); --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Data_in and --@@ Valid numsp assignment --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ valid_numsp <= val_nsp_d and not val_nsp_dd; process ( Global_Reset, Clocks_In.P_Clk ) begin if ( Global_Reset = '1' ) then numsp <= (others => '0'); val_nsp_d <= '0'; val_nsp_dd <= '0'; elsif ( rising_edge ( Clocks_In.P_Clk ) ) then val_nsp_d <= '0'; if (LAD_Regs(0).Strobe = '1') then numsp <= LAD_Regs(0).Data_in( 9 downto 0 ); val_nsp_d <= '1'; end if; val_nsp_dd <= val_nsp_d; end if; end process;
process( Global_Reset, Clocks_In.K_Clk) begin
if ( Global_Reset = '1' ) then done_d <= '0'; elsif ( rising_edge ( Clocks_In.K_Clk ) ) then done_d <= done; end if; end process; --------------------------------------------------------- -- -- Data_in -- --------------------------------------------------------- process ( Global_Reset, Clocks_In.P_Clk ) begin if ( Global_Reset = '1' ) then valid_dst <= '0'; Data_in <= (others => '0'); elsif ( rising_edge ( Clocks_In.P_Clk ) ) then if (Left_Mem_Mux(1).Data_Valid = '1') then valid_dst <= Left_Mem_Mux(1).Data_Valid; Data_in <= Left_Mem_Mux(1).Data_in; end if; end if; end process; --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Store the done signal in PE registers --@@ for the host to poll and read --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
process ( Global_Reset, Clocks_In.K_Clk ) begin if ( Global_Reset = '1' ) then
LAD_Regs(1).Data_out <= (others => '0'); elsif ( rising_edge ( Clocks_In.K_Clk ) ) then if (done_d = '1') then LAD_Regs(1).Data_out <= "00000000000000000000000000000001"; end if; end if; end process; --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Address modification component instantiation --@@ --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
U_addr_mod : addr_mod port map (Clocks_In.P_Clk, Global_Reset,
addr_cnt, row_cnt, mem_addr,
numsp, numsp_1, ext_node,
rmem_read, rmem_write, ad_reg_en, ad_reg_clr,
row_zero, addr_modif ); --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Assign left_mem_data and right_mem_data --@@ with data from UPGMA design - avg_dst and --@@ tree_data respectively --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ process(Global_Reset, Clocks_In.M_Clk) variable temp : integer; begin if ( Global_Reset = '1' ) then left_mem_data_Addr <= (others => '0'); left_mem_data_Write <= '0'; left_mem_data_Data_Out <= (others => '0'); left_mem_data_Req <= '0'; temp := 0; elsif ( rising_edge ( Clocks_In.M_Clk ) ) then temp := std_logic_vector_to_integer(addr_modif); left_mem_data_Addr <= integer_to_std_logic_vector(temp, 31); left_mem_data_Write <= write_mem; left_mem_data_Data_Out <= avg_dst; left_mem_data_Req <= (write_mem or read_mem ); end if; end process; process(Global_Reset, Clocks_In.M_Clk) begin if ( Global_Reset = '1' ) then right_mem_data_Addr <= (others => '0'); right_mem_data_Write <= '0'; right_mem_data_Data_Out <= (others => '0'); right_mem_data_Req <= '0'; elsif ( rising_edge ( Clocks_In.M_Clk ) ) then right_mem_data_Addr <= "0000000000000000000000" & trout(35 downto 26); right_mem_data_Write <= valid_td; right_mem_data_Data_Out <= trout(35 downto 16)& trout(11 downto 0); right_mem_data_Req <= valid_td; end if;
end process; --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Assign left_mem_mux(1) and right_mem_mux(1) --@@ with left_mem_data and right_mem_data --@@ respectively --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ Left_Mem_Mux(1).Addr <= left_mem_data_Addr; Left_Mem_Mux(1).Write <= left_mem_data_Write; Left_Mem_Mux(1).Data_Out <= left_mem_data_Data_Out;-- when Left_Mem_Mux(1).Akk = '0' else Left_Mem_Mux(1).Data_out; Left_Mem_Mux(1).Req <= left_mem_data_Req; Right_Mem_Mux(1).Addr <= Right_mem_data_Addr; Right_Mem_Mux(1).Write <= Right_mem_data_Write; Right_Mem_Mux(1).Data_Out <= Right_mem_data_Data_Out;-- when Right_Mem_Mux(1).Akk = '0' else Right_Mem_Mux(1).Data_out; Right_Mem_Mux(1).Req <= Right_mem_data_Req; -- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@-- --@@-- --@@ LEFT I/O CONNECTOR INTERFACE : The following-- --@@ component provides an interface to the left I/O-- --@@ connector on the WILDCARD(tm). Uncomment the-- --@@ inteface below to use the left I/O connector.-- --@@-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@---- U_Left_IO : IO_Conn_Std_IF-- port map-- (-- Pads => Pads.Left_IO,-- User_In => Left_IO_In,-- User_Out => Left_IO_Out-- );---- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@-- --@@-- --@@ RIGHT I/O CONNECTOR INTERFACE : The following-- --@@ component provides an interface to the right I/O-- --@@ connector on the WILDCARD(tm). Uncomment the-- --@@ inteface below to use the right I/O connector.-- --@@-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@---- U_Right_IO : IO_Conn_Std_IF-- port map-- (-- Pads => Pads.Right_IO,-- User_In => Right_IO_In,-- User_Out => Right_IO_Out-- );--
-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@-- --@@-- --@@ AUDIO INTERFACE : Uncomment the following-- --@@ interface to use the audio port.-- --@@-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@---- U_Audio : Audio_Std_IF-- port map-- (-- Clk => Clocks_In.K_Clk,-- Global_Reset => Global_Reset,-- Pads => Pads.Audio,-- User_Out => Audio_Out-- );
---------------------------------------------------------------------- -- NOTE : The following line must remain in all designs -- to ensure that all of the PE pads are driven. ---------------------------------------------------------------------- Init_PE_Pads ( Pads );
end architecture;
-------------------------------------------------------- 32-bit Register entity - architecture-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : reg.vhd-- Entity : reg-- Architecture : reg_behave------------------------------------------------------library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;
entity reg is port( data : in std_logic_vector(31 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(31 downto 0) );end reg;
architecture reg_beh of reg isbegin
process(clk, reset) begin if ( reset = '1' ) then regval <= (others => '0');
elsif (Rising_Edge ( clk )) then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= data; end if; end if; end process;
end reg_beh;
-------------------------------------------------------- 10-bit Register entity - architecture-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : reg2.vhd-- Entity : reg_2-- Architecture : reg2_beh------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;
entity reg_2 is port( data : in std_logic_vector(9 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(9 downto 0));end reg_2;
architecture reg2_beh of reg_2 isbegin
process(clk, reset) begin if ( reset = '1' ) then regval <= (others => '0'); elsif ( Rising_Edge( clk )) then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= data; end if; end if; end process;
end reg2_beh;
-------------------------------------------------------- 16-bit Register entity - architecture
-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : reg3.vhd-- Entity : reg_3-- Architecture : reg3_beh------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;
entity reg_3 is port( data : in std_logic_vector(15 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(15 downto 0));end reg_3;
architecture reg3_beh of reg_3 isbegin
process(clk, reset) begin if ( reset = '1' ) then regval <= (others => '0'); elsif ( Rising_Edge( clk )) then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= data; end if; end if; end process;
end reg3_beh;
-------------------------------------------------------- Top Component UPGMA entity - architecture-------------------------------------------------------------------------------------------------------------- Author : Sreesa Akella-- File : upgma_top.vhd-- Entity : upgma_top-- Architecture : upgma_struct------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use work.std_logic_prims.all;
entity upgma_top is port(clk : in std_logic;
Reset : in std_logic; Data_in : in std_logic_vector(31 downto 0); valid_numsp : in std_logic; valid_dst : in std_logic; addr_cnt : out std_logic_vector(9 downto 0); row_cnt : out std_logic_vector(9 downto 0); ext_node : out std_logic; rmem_read : out std_logic; rmem_write : out std_logic; ad_reg_en : out std_logic; ad_reg_clr : out std_logic; row_zero : out std_logic; mem_addr : out std_logic_vector(19 downto 0); read_mem : out std_logic; write_mem : out std_logic; numsp_1 : out std_logic_vector(9 downto 0); numsp : in std_logic_vector(9 downto 0); avg_dst : out std_logic_vector(31 downto 0); trout : out std_logic_vector(36 downto 0); valid_td : out std_logic; done : out std_logic);end upgma_top;
architecture upgma_struct of upgma_top is
component adder port(Datainp1 : in std_logic_vector(31 downto 0); Datainp2 : in std_logic_vector(31 downto 0); Data_out : out std_logic_vector(31 downto 0));end component;
component adder_w port(Datainp1 : in std_logic_vector(15 downto 0); Datainp2 : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(15 downto 0));end component;
component adderwreg port(addout : in std_logic_vector(15 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(15 downto 0));end component;
component ctrl_blk port(clk : in std_logic; reset : in std_logic; valid_numsp : in std_logic; addr_grt : in std_logic; child_cnt_gr : in std_logic; count_gr : in std_logic; all_nodes_done : in std_logic; initialized : in std_logic; div_valid : in std_logic;
a_grt : in std_logic; ext_node : in std_logic; r_clr : out std_logic; r_inc : out std_logic; a_clr : out std_logic; a_inc : out std_logic; c2_read : out std_logic; mem_update : out std_logic; R_dec : out std_logic; Rp_dec : out std_logic; c1_incr : out std_logic; c2_incr : out std_logic; c1p_incr : out std_logic;
ch_incr : out std_logic; c1_load1 : out std_logic; c1_load2 : out std_logic; c2_load1 : out std_logic; c2_load2 : out std_logic; c1p_load : out std_logic; ch_load : out std_logic; c1_clr : out std_logic; c2_clr : out std_logic; c1p_clr : out std_logic; ch_clr : out std_logic; row_col_sel : out std_logic_vector(1 downto 0); addr2_reg_dec : out std_logic; node_write : out std_logic; read_mem : out std_logic; write_mem : out std_logic; read_wmem : out std_logic; write_wmem : out std_logic; rowreg_clr : out std_logic; colreg_clr : out std_logic; distreg_clr : out std_logic; mulreg_en : out std_logic; mulreg_clr : out std_logic; addregwclr : out std_logic; addregwen : out std_logic; addregclr : out std_logic; addregen : out std_logic; divreg1clr : out std_logic; divreg1en : out std_logic; initial_run : out std_logic; store_cur_addr : out std_logic; node_mem_initialize : out std_logic; mem_initialize : out std_logic; addr_gen1_en : out std_logic; addr_gen2_en : out std_logic; rmem_read : out std_logic; rmem_write : out std_logic; ad_reg_en : out std_logic; ad_reg_clr : out std_logic;
row_zero : out std_logic; numsp_val : out std_logic;
valid_td : out std_logic; nodeid_sel : out std_logic_vector(1 downto 0);
n_type_sel : out std_logic;
incnt_inc : out std_logic; done : out std_logic ); end component;
component opt_sel port(clk : in std_logic; reset : in std_logic; valid_numsp : in std_logic; numsp : in std_logic_vector(9 downto 0); row_reg : in std_logic_vector(9 downto 0); col_reg : in std_logic_vector(9 downto 0); store_cur_addr : in std_logic; node_mem_initialize : in std_logic; mem_initialize : in std_logic; addr_gen1_en : in std_logic; addr_gen2_en : in std_logic; c2_read : in std_logic; mem_update : in std_logic; addr1_reg_en : in std_logic; addr2_reg_en : in std_logic; R_dec : in std_logic; Rp_dec : in std_logic; c1_incr : in std_logic; c2_incr : in std_logic; c1p_incr : in std_logic;
ch_incr : in std_logic; c1_load1 : in std_logic; c1_load2 : in std_logic; c2_load1 : in std_logic; c2_load2 : in std_logic; c1p_load : in std_logic;
ch_load : in std_logic; c1_clr : in std_logic;
c2_clr : in std_logic; c1p_clr : in std_logic; ch_clr : in std_logic;
row_col_sel : in std_logic_vector(1 downto 0); addr2_reg_dec : in std_logic;
node_write : in std_logic; distreg_val : in std_logic_vector(31 downto 0); nodeid_sel : in std_logic_vector(1 downto 0); n_type_sel : in std_logic; incnt_inc : in std_logic; initial_run : in std_logic; r_clr : in std_logic; r_inc : in std_logic; a_clr : in std_logic; a_inc : in std_logic; a_grt : out std_logic; ext_node : out std_logic; numsp_1 : out std_logic_vector(9 downto 0); addr_cnt : out std_logic_vector(9 downto 0); row_cnt : out std_logic_vector(9 downto 0); first_val : out std_logic; initialized : out std_logic; addr_grt : out std_logic;
child_cnt_gr : out std_logic; count_gr : out std_logic; all_nodes_done : out std_logic; addr : out std_logic_vector(19 downto 0); cnt1 : out std_logic_vector(9 downto 0); nodeid : out std_logic_vector(9 downto 0); n_type : out std_logic; par : out std_logic_vector(9 downto 0); br_len : out std_logic_vector(15 downto 0));end component;
component mult port(Datainp1 : in std_logic_vector(31 downto 0); Datainp2 : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(31 downto 0));end component;
component divider port(datainp1 : in std_logic_vector(31 downto 0); divider : in std_logic_vector(15 downto 0); output : out std_logic_vector(31 downto 0); valid : out std_logic);end component;
component comparedst port(Datainp1 : in std_logic_vector(31 downto 0); valid_dst : in std_logic; distreg_val : in std_logic_vector(31 downto 0); first_val : in std_logic; addr : in std_logic_vector(19 downto 0); distreginp : out std_logic_vector(31 downto 0); distreg_en : out std_logic; rowreginp : out std_logic_vector(9 downto 0); rowreg_en : out std_logic; colreginp : out std_logic_vector(9 downto 0); colreg_en : out std_logic; addr1_reg_en : out std_logic; addr2_reg_en : out std_logic);end component;
component numofspreg port(numsp : in std_logic_vector(9 downto 0); reset : in std_logic; clk : in std_logic; valid_in : in std_logic; valid : out std_logic; regval : out std_logic_vector(9 downto 0));end component;
component wmemory port ( Clk : in std_logic; Reset : in std_logic; Read : in std_logic; Write : in std_logic; numsp : in std_logic_vector(9 downto 0);
Addr : in std_logic_vector(9 downto 0); Data : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(15 downto 0) );end component; component reg port(data : in std_logic_vector(31 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(31 downto 0));end component;
component reg_2 port(data : in std_logic_vector(9 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(9 downto 0));end component;
component addr_dd port(addr : in std_logic_vector(19 downto 0); reset : in std_logic; clk : in std_logic; addr_dd_s : out std_logic_vector(19 downto 0));end component;
component first_val_ddd port(first_val : in std_logic; reset : in std_logic; clk : in std_logic; first_val_ddd_s : out std_logic);end component;
component d_f port(inp : in std_logic; reset : in std_logic; clk : in std_logic; opt : out std_logic);end component;
-- Multiplier and multiplier reg component signalssignal mult_out : std_logic_vector(31 downto 0);signal mulreg_val : std_logic_vector(31 downto 0);
-- Adder and adderreg component signalssignal adderout : std_logic_vector(31 downto 0);signal adderregval : std_logic_vector(31 downto 0);
-- Adderw and adderwreg component signalssignal adderw_out : std_logic_vector(15 downto 0);signal adderw_regval : std_logic_vector(15 downto 0);
-- Colreg component signalssignal Colregval : std_logic_vector(9 downto 0);
-- Controller component signalssignal store_cur_addr : std_logic;signal a_clr, a_inc : std_logic;signal r_clr, r_inc : std_logic;signal read_wmem : std_logic;signal write_wmem : std_logic;signal rowreg_en : std_logic;signal rowreg_clr : std_logic;signal colreg_en : std_logic;signal colreg_clr : std_logic;signal distreg_clr : std_logic;signal distreg_en : std_logic;signal mulreg_en : std_logic;signal mulreg_clr : std_logic;signal adderw_valid : std_logic;signal addregwclr : std_logic;signal addregwen : std_logic;signal addregclr : std_logic;signal addregen : std_logic;signal divreg1en : std_logic;signal divreg1clr : std_logic;signal initial_run : std_logic;signal numsp_val : std_logic;signal node_mem_initialize : std_logic;signal mem_initialize : std_logic;signal addr_gen1_en : std_logic;signal addr_gen2_en : std_logic;signal nodeid_sel : std_logic_vector(1 downto 0);signal n_type_sel : std_logic;signal spcnt_inc : std_logic;signal incnt_inc : std_logic; signal node_sel_mem_en : std_logic;signal parent_mem_en : std_logic;signal mem_update : std_logic;signal c2_read : std_logic;signal R_dec : std_logic;signal Rp_dec : std_logic;signal c1_incr : std_logic;signal c2_incr : std_logic;signal c1p_incr : std_logic;signal ch_incr : std_logic;signal c1_load1 : std_logic;signal c1_load2 : std_logic;signal c2_load1 : std_logic;signal c2_load2 : std_logic;signal c1p_load : std_logic; signal ch_load : std_logic; signal c1_clr : std_logic;signal c2_clr : std_logic;signal c1p_clr : std_logic; signal ch_clr : std_logic;signal row_col_sel : std_logic_vector(1 downto 0);signal addr2_reg_dec : std_logic;signal node_write : std_logic;
--Optsel signalssignal a_grt : std_logic;signal ex_n : std_logic;signal initialized : std_logic;signal addr_grt : std_logic;signal child_cnt_gr : std_logic;signal count_gr : std_logic;signal all_nodes_done : std_logic;signal addr : std_logic_vector(19 downto 0);signal cnt1 : std_logic_vector(9 downto 0);signal first_val : std_logic;signal nodeid : std_logic_vector(9 downto 0);signal n_type : std_logic;signal par : std_logic_vector(9 downto 0);signal br_len : std_logic_vector(15 downto 0);
-- Divider component signalssignal div_out : std_logic_vector(31 downto 0);signal div_valid : std_logic;
-- Divider reg 1 component signalssignal div_reg_val1 : std_logic_vector(31 downto 0);
-- Least Distance reg component signalssignal distreg_val : std_logic_vector(31 downto 0);
-- Comparedst component signalssignal distreginp : std_logic_vector(31 downto 0);signal rowreginp : std_logic_vector(9 downto 0);signal colreginp : std_logic_vector(9 downto 0);signal addr1_reg_en : std_logic;signal addr2_reg_en : std_logic;
-- numspreg component signalssignal numspreg_val : std_logic_vector(9 downto 0);signal numspreg_valid : std_logic;
-- Rowreg component signalssignal Rowregval : std_logic_vector(9 downto 0);
-- Weight memory component signalssignal WData : std_logic_vector(15 downto 0);signal WAddr : std_logic_vector(9 downto 0);signal vone : std_logic_vector(15 downto 0);signal WData_in : std_logic_vector(15 downto 0);
-- Addr twice registersignal addr_dd_s : std_logic_vector(19 downto 0);
-- first_val registered thricesignal first_val_ddd_s : std_logic;
begin
Mul : mult port map(Data_in, WData_in,
mult_out);
mulreg : reg port map(mult_out, Reset, Clk, mulreg_en, mulreg_clr, mulreg_val); U0 : adder port map(mulreg_val, adderregval, adderout);
adderreg : reg port map(adderout, Reset, Clk, addregen, addregclr, adderregval);
U011 : adder_w port map(WData_in, adderw_regval, adderw_out);
U012 : adderwreg port map(adderw_out, reset, clk, addregwen, addregwclr, adderw_regval);
Colreg : reg_2 port map(Colreginp, Reset, Clk, colreg_en, colreg_clr, Colregval);
U3 : ctrl_blk port map( clk => clk, reset => reset, valid_numsp => valid_numsp, addr_grt => addr_grt, child_cnt_gr => child_cnt_gr, count_gr => count_gr, all_nodes_done => all_nodes_done, initialized => initialized, div_valid => div_valid, a_grt => a_grt, ext_node => ex_n, r_clr => r_clr, r_inc => r_inc, a_clr => a_clr, a_inc => a_inc,
c2_read => c2_read, mem_update => mem_update, R_dec => R_dec, Rp_dec => Rp_dec, c1_incr => c1_incr, c2_incr => c2_incr, c1p_incr => c1p_incr, ch_incr => ch_incr, c1_load1 => c1_load1, c1_load2 => c1_load2, c2_load1 => c2_load1, c2_load2 => c2_load2, c1p_load => c1p_load, ch_load => ch_load, c1_clr => c1_clr, c2_clr => c2_clr, c1p_clr => c1p_clr, ch_clr => ch_clr, row_col_sel => row_col_sel, addr2_reg_dec => addr2_reg_dec, node_write => node_write, read_mem => read_mem, write_mem => write_mem, read_wmem => read_wmem, write_wmem => write_wmem, rowreg_clr => rowreg_clr, colreg_clr => colreg_clr, distreg_clr => distreg_clr, mulreg_en => mulreg_en, mulreg_clr => mulreg_clr, addregwclr => addregwclr, addregwen => addregwen, addregclr => addregclr, addregen => addregen, divreg1clr => divreg1clr, divreg1en => divreg1en, initial_run => initial_run, store_cur_addr => store_cur_addr, node_mem_initialize => node_mem_initialize, mem_initialize => mem_initialize, addr_gen1_en => addr_gen1_en, addr_gen2_en => addr_gen2_en, rmem_read => rmem_read, rmem_write => rmem_write, ad_reg_en => ad_reg_en, ad_reg_clr => ad_reg_clr, row_zero => row_zero, numsp_val => numsp_val, valid_td => valid_td, nodeid_sel => nodeid_sel, n_type_sel => n_type_sel, incnt_inc => incnt_inc, done => done );
U_Addr_dd : addr_dd port map
(addr, reset, clk, addr_dd_s);
U_fv_ddd : first_val_ddd port map (first_val, reset, clk, first_val_ddd_s); Opt_gen : opt_sel port map (clk, reset, valid_numsp, numsp, rowregval, colregval, store_cur_addr, node_mem_initialize, mem_initialize, addr_gen1_en, addr_gen2_en, c2_read, mem_update, addr1_reg_en, addr2_reg_en, R_dec, Rp_dec, c1_incr, c2_incr, c1p_incr, ch_incr, c1_load1, c1_load2, c2_load1, c2_load2, c1p_load, ch_load, c1_clr, c2_clr, c1p_clr, ch_clr, row_col_sel, addr2_reg_dec, node_write, distreg_val, nodeid_sel, n_type_sel, incnt_inc, initial_run, r_clr, r_inc, a_clr, a_inc, a_grt, ex_n,
numsp_1, addr_cnt, row_cnt, first_val, initialized, addr_grt, child_cnt_gr, count_gr, all_nodes_done, addr, cnt1, nodeid, n_type, par, br_len);
U4 : divider port map (adderregval, adderw_regval, div_out, div_valid);
divregister1 : reg port map(div_out, Reset, Clk, divreg1en, divreg1clr, avg_dst);
leastdistreg : reg port map(distreginp, Reset, Clk, distreg_en, distreg_clr, distreg_val); U15 : comparedst port map(Data_in, valid_dst, distreg_val, first_val_ddd_s, addr_dd_s, distreginp, distreg_en, rowreginp, rowreg_en, colreginp, colreg_en, addr1_reg_en, addr2_reg_en);
U10 : numofspreg port map(numsp, Reset,
Clk, numsp_val, numspreg_valid, numspreg_val);
Rowreg : reg_2 port map(Rowreginp, Reset, Clk, rowreg_en, rowreg_clr, Rowregval);
vone <= "0000000000000001"; WData <= vone when node_mem_initialize = '1' else adderw_regval; WAddr <= cnt1 when node_mem_initialize = '1' else addr(19 downto 10); U12 : wmemory port map(Clk, Reset, read_wmem, write_wmem, numsp, WAddr, WData, WData_in);
trout <= n_type & nodeid & par & br_len; mem_addr <= addr;
U20 : d_f port map(ex_n, Reset, Clk, ext_node); end upgma_struct;
APPENDIX B
CUSTOM COMPUTING MACHINE HOST PROGRAM SOURCE CODE
UPGMA_ex.h
#ifndef __UPGMATEST_H__#define __UPGMATEST_H__/**************************************************** * * Contants and Macros * ****************************************************/
#define DEFAULT_VERBOSITY ( FALSE )#define DEFAULT_SLOT_NUMBER ( 0 )#define DEFAULT_ITERATIONS ( 1 )#define DEFAULT_FREQUENCY ( 100.0 )
#define IMAGE_FILENAME ( "pe_addr_mod" )#define IMAGE_FILENAME_REVD ( "pe_addr_mod" )
#define MEM_BASE ( 0x0 )#define LEFT_MEM_OFFSET ( 0x1000 )#define RIGHT_MEM_OFFSET ( 0x1200 )
#define MAX_ERR_COUNT ( 32 )
#define NUM_REGISTERS ( 2 )#define REGISTER_OFFSET ( 0x2000 )
typedef struct _TestInfo_{ WC_DeviceNum DeviceNum; WC_DevConfig DeviceCfg; WC_Version Version; DWORD dIterations; float fClkFreq; BOOLEAN bVerbose;} WC_TestInfo;
/**************************************************** * * Prototypes * ****************************************************/
WC_RetCode WC_UPGMATest_Main( WC_TestInfo *TestInfo );
WC_RetCode WC_UPGMATest_Init( WC_TestInfo *TestInfo );
WC_RetCode WC_UPGMATest_Run( WC_TestInfo *TestInfo );
WC_RetCode WC_UPGMATest_Shutdown( WC_TestInfo *TestInfo );
WC_RetCode VerifyData(DWORD ref[], DWORD test[], DWORD size);
#endif
UPGMA_ex.c
/**************************************************************************** * * File : UPGMAtest.c * * Project : UPGMA on Wildcard * * Copyright : Sreesa Akella, Reconfigurable Computing Research Lab 2003 * ****************************************************************************/#include <stdio.h>#include <time.h>#include <math.h>#if defined(WIN32)#include <windows.h>#endif#include "wcdefs.h"#include "wc_shared.h"#include "UPGMA_ex.h"#include "LAD_Mem_Bridge_WC.h"
/**************************************************************************** * * Function : main * * Description : Entry point for the WILDCARD * This function is a basic entry point into the test. * It is responcibe for * 1) Parsing the command line parametrs and filling the * TestInfo struct with those parameters * 2) Opening the WILDCARD(tm) board * 3) Calling the main example procedure * 4) Closing the board when the example completes * ****************************************************************************/WC_RetCodemain( int argc, char *argv [] ){ WC_RetCode rc = WC_SUCCESS;
int argi;
WC_TestInfo TestInfo;
WC_CardType CardType;
char **TestLoc = NULL;
const char * help_string = "Usage: memtest <list of options>\n" " Options:\n" " -v Sets verbose mode. Show progress messages.\n" " -s <num> Set WILDCARD(tm) device \"slot\" number (default = 0).\n" " -i <num> Sets the number of times to perform the example.\n" " (default = 1)\n" " -f <num> Set the memory clock frequency in MHz (default = 40.0)\n" " -h Show this help.\n";
fprintf( stdout, "WILDCARD(tm) UPGMA_Test Example\n");
TestInfo.bVerbose = DEFAULT_VERBOSITY; TestInfo.DeviceNum = DEFAULT_SLOT_NUMBER; TestInfo.dIterations = DEFAULT_ITERATIONS; TestInfo.fClkFreq = DEFAULT_FREQUENCY;
/* Parse the command line parameters */ for ( argi = 1; argi < argc; argi++ ) { if ( argv[argi][0] == '-' ) { switch ( toupper(argv[argi][1]) ) { case 'H': /* Print the help message */ fprintf ( stdout, "%s\n\n", help_string,1,1 ); return(WC_SUCCESS); break;
case 'I': /* Set the number of iterations */ argi++; TestInfo.dIterations = strtoul( argv [ argi ], TestLoc, 0 ); /* Error Check the result. * The following test will be true only if there was an error * in the string conversion above */ if (TestInfo.dIterations == 0) { fprintf( stdout, "\nWARNING: An invalid or missing iteration value\n"); fprintf( stdout, " was found after the -i option.\n\n"); fprintf( stdout, "%s\n\n", help_string ); return (ERROR_UNKNOWN_SWITCH); }
fprintf( stdout, "Setting the iteration value to %d\n",TestInfo.dIterations); break;
case 'S': /* Set the device number */ argi++; TestInfo.DeviceNum = strtoul( argv [ argi ], TestLoc, 0 ); /* The following tests for a valid slot number */ if (TestInfo.DeviceNum > WC_MAX_DEVICES)
{ fprintf( stdout, "\n WARNING: Invalid device number!\n"); return (ERROR_UNKNOWN_SWITCH); } else { fprintf( stdout, " Setting the device number to %d.\n", TestInfo.DeviceNum); } break;
case 'F': /* Set Frequency */ argi++; if (argi < argc) { TestInfo.fClkFreq = (float) atof( argv [ argi ] ); } else { printf( "\n WARNING: Invalid Frequency option\n" ); printf ( "%s\n\n", help_string ); return(ERROR_UNKNOWN_SWITCH); } if (( TestInfo.fClkFreq < WC_MIN_FCLK_MHZ ) || ( TestInfo.fClkFreq > WC_MAX_FCLK_MHZ )) { printf( "\n WARNING: %3.2f is an invalid Frequency option\n", TestInfo.fClkFreq ); printf ( "%s\n\n", help_string ); return(ERROR_UNKNOWN_SWITCH); } break;
case 'V': /* Show all Errors & set maximum verbosity */ TestInfo.bVerbose=TRUE; fprintf( stdout, " Setting Maximum Verbosity.\n"); break;
default: /* Unknow switch option */ fprintf ( stderr, "\n WARNING: Unknown option: \"%s\"\n", argv [ argi ] ); fprintf ( stderr, "%s\n\n", help_string ); return( ERROR_UNKNOWN_SWITCH ); } } else /* Missing the '-' */ { fprintf ( stderr, "\n WARNING: Unknown option: \"%s\"\n", argv[argi] ); fprintf ( stderr, "%s\n\n", help_string ); return(ERROR_UNKNOWN_SWITCH); } }
/* The WILDCARD(tm) MUST be opened before doing any type of * * access to the card. */ if (TestInfo.bVerbose) { fprintf(stdout,"\n Opening Device %d...\n", TestInfo.DeviceNum); }
rc = WC_Open( TestInfo.DeviceNum, 0 ); DISPLAY_ERROR(rc);
/* If you are using both the WILDCARD(tm) and the WILDCARD(tm)-II * * it is a good idea to check the board type before executing any * * calls. For this example we must have a WILDCARD(tm). */ rc = WC_GetCardType( TestInfo.DeviceNum, &CardType ); if (rc!=WC_SUCCESS) { DisplayError(rc); return 0; } else if (CardType != WILDCARD) { printf("\nERROR : This example requires a WILDCARD(tm).\n It will not run on WILDCARD(tm)-II!\n\n"); return 0; }
/* Once the board is successfully opened, the test may be run */ rc = WC_UPGMATest_Main( &TestInfo); if (rc!=WC_SUCCESS) { DisplayError(rc); }
/* The WILDCARD(tm) should be closed when the program finishes to * * free driver resources */ rc = WC_Close( TestInfo.DeviceNum ); DISPLAY_ERROR(rc); return(rc);}
/**************************************************************************** * * Function : WC_UPGMATest_Main * * Parameters : TestInfo - Test Parameters * * Description : Initializes the WILDCARD(tm) hardware, and runs the example * TestInfo->dIterations times. * ****************************************************************************/WC_RetCodeWC_UPGMATest_Main (WC_TestInfo *TestInfo){ DWORD dIteration, dErrorCount;
WC_RetCode rc = WC_SUCCESS;
/* Print out a few parameters so we know what we are running */ if (TestInfo->bVerbose) {
fprintf(stdout,"\n TEST PARAMETERS:\n"); fprintf(stdout," Clock Frequency = %f\n",TestInfo->fClkFreq); fprintf(stdout," # of Iterations = %d\n",TestInfo->dIterations); fprintf(stdout," Device Number = %d\n",TestInfo->DeviceNum); fprintf(stdout," Verbose Mode = %s\n",TestInfo->bVerbose?"TRUE":"FALSE"); }
/* This routine will put the WILDCARD in a known state. * * We only need to do this before the first iteration. * * Each additional iteration only needs to reset the * * PE to initialize the WILCARD to a known state because * * All initialization parameters are kept between resets.*/ rc = WC_UPGMATest_Init( TestInfo); //CHECK_RC(rc)
/* Now that the PE is initialized, we run the test * * TestInfo->dIterations times, counting the number of * * failures as we go. */ for (dIteration = 0,dErrorCount = 0; dIteration < TestInfo->dIterations; dIteration++) { fprintf(stdout,"\n **** Memory Example Iteration [%d] of [%d] ****\n",dIteration, TestInfo->dIterations); rc = WC_UPGMATest_Run(TestInfo); if (rc != WC_SUCCESS) { DisplayError(rc); dErrorCount++; } }
/* Let the user know if the example was a success */ fprintf(stdout, "\n Example Complete! [%d] of [%d] Successful", TestInfo->dIterations - dErrorCount, TestInfo->dIterations);
if (dErrorCount) { fprintf(stdout, " ERRORS In Example!\n\n"); } else { fprintf(stdout, " Example SUCCESSFUL!\n\n"); }
/* Return SUCCESS if we have made it this far without * * returning. This means that no fatal errors have * * occurred. If any test errors occurred, they have * * already been printed above after each iteration. */ return (WC_SUCCESS);}
/**************************************************************************** * * Function : WC_UPGMATest_Run * * Parameters : TestInfo - Test Parameters *
* Description : Runs the Memory Example. This hardware for this example * contains an image with a LAD_Mem_Bridge component for both * the left and the right onboard memories. This gives the * host indirect access to the onboard WILDCARD(tm) memories. * * This example will write a random pattern to each of the * memories, read it back, and verify that the read and * write contents are equal. * ****************************************************************************/WC_RetCodeWC_UPGMATest_Run( WC_TestInfo *TestInfo ){ WC_Mem_Object *Left_Memory, *Right_Memory;
DWORD dNumDwords, *pReadBuffer, *pWriteBuffer, index, no_of_values, no_of_species, temp1, temp2, *darray, *dataArray, addr, rem, *bin, value, j;
BOOLEAN bIntStatus, done;
FILE *f, *f1; //time_t tStart, tEnd;
//double diffTime = 0.0; clock_t tStartClk, tEndClk, tend_wr, tend_done;
WC_RetCode rc;
/* The first step in an application is almost always to reset the PE. * * Although this is not needed for the first iteration of this * * example because WC_MemTest_Init has already reset the PE, we need * * to do it again here because subsequent iterations need a fresh PE * * reset. * * One assumption made below is that the time between the two * * WC_PeReset calls is sufficient to reset the PE. In this example, * * and in general, this is true. The reset line need only be high * * for at most one clock cycle of the longest period clock. */ fprintf(stdout, "\n Resetting PE ... "); rc=WC_PeReset( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc);
rc=WC_PeReset( TestInfo->DeviceNum, FALSE ); CHECK_RC(rc); fprintf(stdout, "DONE\n");
rc=WC_IntEnable( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc);
/* Reset the interrupts */ rc = WC_IntReset(TestInfo->DeviceNum); CHECK_RC(rc);
rc=WC_PeReset( TestInfo->DeviceNum, FALSE ); CHECK_RC(rc); fprintf(stdout, "DONE\n");
/* Check to make sure that the interrupts are cleared */ fprintf(stdout, " Verifying Interrupts are cleared ... "); WC_IntQueryStatus(TestInfo->DeviceNum, &bIntStatus); if (bIntStatus) { fprintf (stdout, "ERROR\n\n Interrupt ERROR : Interrupts were NOT Cleared\n"); return (WC_ERR_INTERRUPT_TIMEOUT); /* Not a good error code, but will work for now */ } fprintf(stdout, "DONE\n");
/* The simplest method of buffer verification is to have different * * buffers for reading and writing. Below we allocate and initialize * * these buffers. * * First, however, we need to find the memory size so we know how * * large of a buffer to get. This information is stored in the * * device information structure filled by WC_MemTest_Init. */
/* The memory port sizes should be equal, but in case they aren't * * we allocate buffers assuming the largest memory size. */ fprintf( stdout, " Allocating Buffers ... ");
if (TestInfo->DeviceCfg.MemoryDwords[0] >= TestInfo->DeviceCfg.MemoryDwords[1]) { dNumDwords = TestInfo->DeviceCfg.MemoryDwords[0]; } else { dNumDwords = TestInfo->DeviceCfg.MemoryDwords[1]; }
if (TestInfo->bVerbose) fprintf(stdout, "\n * Allocating Read Buffer ... "); pReadBuffer = malloc(dNumDwords * sizeof(DWORD)); if (!pReadBuffer) return (ERROR_MEMORY_ALLOC); memset(pReadBuffer, 0, dNumDwords);
if (TestInfo->bVerbose) fprintf(stdout,"DONE\n * Allocating Write Buffer ... "); pWriteBuffer = malloc(dNumDwords * sizeof(DWORD)); if (!pWriteBuffer) { free(pReadBuffer); return (ERROR_MEMORY_ALLOC); }
/* MY COMMENTS
WE NEED TO INSERT CODE HERE FOR READING THE FILE AND WRITING THE TEST DATA INTO THE BUFFERS ALSO dNumDwords SHOULD BE SET TO THEN NO OF WORDS TO BE WRITTEN BASED ON THE NUMBER OF TAXON FOR NOW WE CAN SET THE BUFFER FOR 4 TAXON DATA - 6 WORDS
*/
darray = malloc(dNumDwords * sizeof(DWORD)); if (!darray) return (ERROR_MEMORY_ALLOC); memset(darray, 0, dNumDwords); dataArray = malloc(dNumDwords * sizeof(DWORD)); if (!dataArray) return (ERROR_MEMORY_ALLOC); memset(dataArray, 0, dNumDwords);
bin = malloc(10 * sizeof(DWORD)); if (!bin) return (ERROR_MEMORY_ALLOC); memset(bin, 0, 10);
//For 4 taxon data //no_of_values = 6; //no_of_species = 4; // READ DATA FROM FILE AND PUT IT IN darray f = fopen("C:/akella/thesis/testdatagen/testdata/taxon16/WCprogram/testdata_16D2.txt", "r");
if(f == NULL){ fprintf(stdout, "DONE\n Open of test file failed."); return 1;
} fscanf(f, "%d", &value); no_of_species = value; no_of_values = (no_of_species * (no_of_species - 1)) / 2; //printf("\nnoOfSpecies - %d, noOfValues - %d", no_of_species, no_of_values); for (index = 0; index <= no_of_values; index ++) darray[index] = 0; index = 0; while(fscanf(f, "%d", &value) != EOF){
darray[index] = value; index++;
} /// Reading from memory and placing data in array format into dataArray /*temp1 = 0; temp2 = 0; for (index = 0; index < no_of_values; index ++) { // read the file here and get one value
// for 4 taxon data value = darray[index];
if (temp2 < no_of_species - 1) temp2 = temp2 + 1; else { temp1 = temp1 + 1; temp2 = temp1 + 1; }
for (j = 0; j < 10; j++) { bin[j] = 0; }
addr = 0; rem = temp1; for (j = 0; j < 10 && rem >= 1; j++) { bin[j] = rem % 2; rem = rem / 2; }
for (j = 0; j < 10; j++) { addr = addr + bin[j]*(pow(2, (10+j))); } addr = addr + temp2; dataArray[addr] = value; //data read from file; } */ dNumDwords = no_of_values; for (index = 0; index < dNumDwords; index ++) { pWriteBuffer[index] = darray[index]; }
/* Now we need to allocate and initialize the memory structures for * * the left and right memories. The WC_MemCreate function will * * allocate the structure and fill it with the device number, * * memory offset and flags. * * NOTE : LEFT_MEM_OFFSET and RIGHT_MEM_OFFSET, refer to LAD bus * * offsets NOT memory addresses. Specific memory addresses to read * * and write to are passed to the WC_MemRead and WC_MemWrite * * procedures. */ if (TestInfo->bVerbose) fprintf(stdout,"DONE\n * Allocating Left Memory Struct ... "); Left_Memory = WC_Mem_Create( TestInfo->DeviceNum, LEFT_MEM_OFFSET, 0 ); if (!Left_Memory) { free(pReadBuffer); free(pWriteBuffer); return (ERROR_MEMORY_ALLOC); }
if (TestInfo->bVerbose) fprintf(stdout,"DONE\n * Allocating Right Memory Struct ... ");
Right_Memory = WC_Mem_Create( TestInfo->DeviceNum, RIGHT_MEM_OFFSET, 0 ); if (!Right_Memory) { free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); return (ERROR_MEMORY_ALLOC); }
printf ("DONE\n");
/* With the memory structures initialized, we can now read and write * * to the memories. First we will write to the LEFT memory. */
fprintf(stdout, " Testing LEFT Memory ... "); /* Find the size of this particular memory, and initialize the write * * buffer. */ //time(&tStart); tStartClk = clock(); /* The following two calls, WC_Mem_Write and WC_Mem_Read are defined in * * LAD_Mem_Bridge.c. They uses a specific protocal to interace with the * * LAD_Mem_Bridge component in the PE to read and write data in memory. * * See the documentation inside the LAD_Mem_Bridge.c file for details of * * this memory protocol. */ if (TestInfo->bVerbose) fprintf(stdout, "\n * Writing to memory ... "); rc = WC_Mem_Write(Left_Memory,MEM_BASE,dNumDwords,pWriteBuffer); if (rc != WC_SUCCESS) {
free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); WC_Mem_Release(Right_Memory); return (rc); }
/*WRITE THE NO OF TAXON DATA TO THE LAD Reg TO START THE UPGMA DESIGN */ if (TestInfo->bVerbose) fprintf(stdout, "DONE\n * Writing no of taxon to Registers ... ");
rc = WC_PeRegWrite(TestInfo->DeviceNum, REGISTER_OFFSET, NUM_REGISTERS, &no_of_species); if (rc != WC_SUCCESS) { fprintf(stdout, "\n * Wrting the no of taxons didnt work!!"); } tend_wr = clock();
value = tend_wr - tStartClk; fprintf(stdout, "DONE\nTime taken for Memory write is: %d", value);
/* READ THE REG FILE UNTIL DONE SIGNAL HAS BEEN SET */ //printf("DONE\nThe value of darray[1] is %d", darray[1]); done = FALSE; while(!done){
rc = WC_PeRegRead(TestInfo->DeviceNum, REGISTER_OFFSET, NUM_REGISTERS, darray);
if (rc != WC_SUCCESS) {
fprintf(stdout, "\n * Cudnt read the register!!"); } //printf("\nThe value of darray[1] is %d", darray[1]); /*rc = WC_Mem_Read(Right_Memory, MEM_BASE, dNumDwords, pReadBuffer);
if (rc != WC_SUCCESS) {
free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); WC_Mem_Release(Right_Memory); return (rc);
} printf("\nThe value of rmem[0] is %d", pReadBuffer[0]);*/ if (darray[1] == 1){
tend_done = clock(); done = TRUE;
} else
done = FALSE;
}
//tend_done = clock();
value = (tend_done - tend_wr); //fprintf(stdout, "\ntdone = %d, tendWrite= %d", tend_done, tend_wr); fprintf(stdout, "\nTime taken for done signal is : %d", value);
/* // WAIT FOR INTERRUPT WHICH INDICATES THAT DESIGN HAS COMPLETED RUNNING fprintf(stdout, " Waiting for interrupt ... "); rc = WC_IntWait(TestInfo->DeviceNum, 1000 ); CHECK_RC(rc); */
/* READ RIGHT MEMORY AFTER INTERRUPT IS GENERATED dNumDwords SHOULD BE SET TO NO OF DWORDS TO BE READ */ dNumDwords = (2*no_of_species) - 1; if (TestInfo->bVerbose) fprintf(stdout, "\n * Reading from memory ... "); rc = WC_Mem_Read(Right_Memory, MEM_BASE, dNumDwords, pReadBuffer); if (rc != WC_SUCCESS) { free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); WC_Mem_Release(Right_Memory);
return (rc); } //time(&tEnd); tEndClk = clock(); //diffTime = difftime(tEnd, tStart);
value = tEndClk - tend_done; fprintf(stdout, "DONE\nTime taken for memory read is: %d", value);
value = (tEndClk - tStartClk); printf("\nTotal Time taken: %d", value);
// PRINTING THE TREE DATA f1 = fopen("C:/akella/thesis/thesisOct03/OutputData/TreeData_16D1.txt", "w"); for(index = 0; index < dNumDwords; index++) fprintf (f1, "\nTaxon - %d: %li", index, pReadBuffer[index]);
free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); WC_Mem_Release(Right_Memory); printf ("\nDONE\n"); return (rc);
}
/**************************************************************************** * * Function : DeviceInitialize * * Notes : This function puts the card into a known state before the * test begins. It is generally a bad idea to assume the * state of the WILDCARD's hardware when a program starts. * Previous programs can leave the hardware in an unknown * state, and it's state on power-on is undefined. If an * application requires a specific state of the hardware, * explicitly set that state. * * Before running any application the following steps * should be performed in the order given. * * 1) Toggle Power * 2) Assert the processing element reset line * 3) Program the processing element * 4) Set the clock frequency * 5) Configure Interrupts * 6) Deassert the processing element reset line * ****************************************************************************/WC_RetCodeWC_UPGMATest_Init( WC_TestInfo *TestInfo ){ WC_RetCode
rc=WC_SUCCESS;
/* A great deal of useful information is available from the ID * * PROM on the WILDCARD(tm), including processing element part * * type, memory size, speed grade, etc. The two API calls, * * WC_DeviceInformation and WC_GetVersion are used to retrieve * * that information. The procedure DisplayConfiguration, * * defined in wc_shared.c, displays this information to the * * screen. * * * * Below we use the API calls to store the WILDCARD(tm) device * * and version information in the TestInfo struct for use later * * in the example, as well as display the information if * * verbosity is on. */
rc = WC_DeviceInformation( TestInfo->DeviceNum, &(TestInfo->DeviceCfg) ); CHECK_RC(rc); rc = WC_GetVersion( TestInfo->DeviceNum, &(TestInfo->Version)); CHECK_RC(rc);
if (TestInfo->bVerbose) { rc=DisplayConfiguration(TestInfo->DeviceNum); CHECK_RC(rc); }
/* It should NOT be assumed that the WILDCARD(tm) processing * * element currently has power. Below we toggle the power to * * the processing element, leaving it ON for the remainder of * * the example. */ if (TestInfo->bVerbose) { fprintf (stdout, " Toggling processing element's power...\n"); } rc=WC_PeApplyPower ( TestInfo->DeviceNum, FALSE ); CHECK_RC(rc); rc=WC_PeApplyPower ( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc);
if (TestInfo->bVerbose) fprintf (stdout, " PE power turned on.\n");
/* The WILDCARD(tm) has a dedicated reset line controlled by * * the WC_PeReset API call. In general it is advantageous * * to have the PE in reset when it is being set up. This * * will prevent the design from starting execution until the * * WILDCARD(tm) has been correcly initialized. * * * * Below we assert the reset line and keep it asserted * * until the processing element has been programmed, the * * clock has been set, and interrupts have been initialized. * * * If the Reset_STD_If has been instantiated in the VHDL, * * this API call will set the signal 'Global_Reset' high. */
if (TestInfo->bVerbose)
fprintf (stdout, " Asserting PE Reset Line...\n"); rc=WC_PeReset( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc); if (TestInfo->bVerbose) fprintf (stdout, " PE RESET line asserted.\n");
/* As of the creation of this file there are 4 revisions of * * the WILDCARD(tm) hardware. (Revs A to D) Below we use * * the informaion in TestInfo->Version to determine the * * revision of the card in this slot. */ rc = WC_GetVersion( TestInfo->DeviceNum, &TestInfo->Version); CHECK_RC(rc);
if (TestInfo->bVerbose) fprintf (stdout, " Loading PE Image...\n");
if (((TestInfo->Version.Hardware & WC_MAJOR_VER_MASK)>>WC_MAJOR_VER_SHIFT) == 4) { /* REV D WILDCARD(tm) * * * * The ProgramPeFromFile procedure, found in ws_shared.c, * * will append .\<PART TYPE>\<PACKAGE TYPE>\ to the * * filename, and load that file into the processing element.* * For a REV D this path will be * * .\XCV300E\PKG_BG352\<IMAGE_FILENAME_REVD> */ rc=ProgramPeFromFile( TestInfo->DeviceNum, IMAGE_FILENAME_REVD ); CHECK_RC(rc); } else { /* REV A-C WILDCARD(tm) * * * * The ProgramPeFromFile procedure, found in ws_shared.c, * * will append .\<PART TYPE>\<PACKAGE TYPE>\ to the * * filename, and load that file into the processing element.* * * * For a REV C this path will be * * .\XCV300E\PKG_BG352\<IMAGE_FILENAME> * * * * For REVs A or B this path will be * * .\XCV300\PKG_BG352\<IMAGE_FILENAME> */ rc=ProgramPeFromFile( TestInfo->DeviceNum, IMAGE_FILENAME ); CHECK_RC(rc); }
if (TestInfo->bVerbose) fprintf (stdout, " PE Image Loaded.\n");
/* The WILDCARD(tm) has one on-board programmable oscillator. * * WC_ClkSetFrequency sets the frequency of that clock. We * * always want to set the clock to the appropriate frequency * * before running our application. */
if (TestInfo->bVerbose) fprintf(stdout, " Initializing the clock to %f...\n", TestInfo->fClkFreq); rc=WC_ClkSetFrequency ( TestInfo->DeviceNum, TestInfo->fClkFreq );
CHECK_RC(rc); if (TestInfo->bVerbose) fprintf(stdout, " Clock initialized.\n"); /* This application uses the PE interrupt line, to * * generate an interrupt to the host. Interrupts must * * be enabled before we can receive an interrupt from * * the PE. */ if (TestInfo->bVerbose) fprintf (stdout, " Masking PE Interrupt...\n"); rc=WC_IntEnable( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc); if (TestInfo->bVerbose) fprintf (stdout, " PE Interrupt Masked.\n");
/* The order of mask / reset may be important in some * * circumstances. In our case it is not. We mask, * * then clear anything that may have been happened before * * the masking operation */ if (TestInfo->bVerbose) fprintf (stdout, " Resetting PE Interrupt...\n"); rc=WC_IntReset( TestInfo->DeviceNum ); CHECK_RC(rc); if (TestInfo->bVerbose) fprintf (stdout, " PE Interrupt Reset.\n");
/* Lastly, we remove the PE from the RESET state. When * * the Reset_STD_IF is instantiated in the VHDL, this * * will set the VHDL signal 'Global_Reset' low. */
if (TestInfo->bVerbose) fprintf (stdout, " De-asserting PE Reset Line...\n"); rc=WC_PeReset( TestInfo->DeviceNum, FALSE ); CHECK_RC(rc); if (TestInfo->bVerbose) fprintf (stdout, " PE RESET line de-asserted.\n");
return(rc);}