Implementation of Soft-core processor on FPGA (Final Presentation)
RUN-TIME CUSTOMIZATION OF A SOFT-CORE CPU ON · PDF fileRUN-TIME CUSTOMIZATION OF A SOFT-CORE...
-
Upload
duongkhuong -
Category
Documents
-
view
257 -
download
0
Transcript of RUN-TIME CUSTOMIZATION OF A SOFT-CORE CPU ON · PDF fileRUN-TIME CUSTOMIZATION OF A SOFT-CORE...
RUN-TIME CUSTOMIZATION OF
A SOFT-CORE CPU ON AN FPGA
A DISSERTATION SUBMITTED TO THE UNIVERSITY OF MANCHESTER
FOR THE DEGREE OF MASTER OF SCIENCE
IN THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES
2015
By
Rehab Abdullah Shendi
School of Computer Science
2
Contents
Abstract ................................................................................................................... 8
Declaration .............................................................................................................. 9
Copyright .............................................................................................................. 10
Acknowledgements ............................................................................................... 11
Dedication ............................................................................................................. 12
1 Introduction ........................................................................................................ 13
1.1 Aim and Objectives ...................................................................................... 14
1.2 Report Outline ............................................................................................. 15
2 Background ........................................................................................................ 16
2.1 Reconfigurable Computing .......................................................................... 16
2.1.1 History ................................................................................................... 16
2.1.2 FPGA ..................................................................................................... 17
2.1.3 Reconfiguration Hardware ..................................................................... 20
2.1.4 Partial Reconfiguration .......................................................................... 21
2.2 Microprocessor Architecture ........................................................................ 26
2.2.1 RISC Microprocessor ............................................................................ 26
2.2.2 Soft-Core Microprocessor ...................................................................... 27
2.2.3 MIPS Architecture.................................................................................. 28
2.3 Reconfigurable CPU Instruction Set Extensions .......................................... 30
2.3.1 Custom Instructions in Hardware .......................................................... 31
2.3.2 Custom Instructions in Software ............................................................ 32
2.4 Design Considerations ................................................................................. 33
2.5 Previous Work ............................................................................................. 35
2.5.1 Instruction Set Extension ....................................................................... 35
2.5.2 Partial Reconfiguration .......................................................................... 35
3 System Design and Methodology ...................................................................... 37
3
3.1 System Development Methodology ............................................................. 37
3.2 Implementation Tools .................................................................................. 44
3.2.1 Hardware Description Language ........................................................... 44
3.2.2 Xilinx ISE (Xilinx, 2013) : ....................................................................... 45
3.2.3 Cross compiler: ..................................................................................... 46
3.2.4 FPGA Platform: ..................................................................................... 46
3.2.5 GoAhead ............................................................................................... 47
3.3 System Design............................................................................................. 47
3.3.1 System Definition and Scope ................................................................ 48
3.3.2 System Architecture and Components .................................................. 48
4 Implementation .................................................................................................. 54
4.1 Baseline MIPS Soft-Core ............................................................................. 54
4.2 Custom Instruction in Software .................................................................... 57
4.3 Configuration Controller Modules ................................................................ 59
4.4 Custom Instruction in Hardware .................................................................. 61
4.5 Challenges During Implementation .............................................................. 69
5 Testing, Results and Evaluation ......................................................................... 70
5.1 Testing ......................................................................................................... 70
5.2 Results ......................................................................................................... 76
5.3 Evaluation ................................................................................................... 77
6 Conclusions and Future Work ............................................................................ 81
6.1 Conclusions ................................................................................................. 81
6.2 Future Work ................................................................................................. 81
Works Cited .......................................................................................................... 83
Appendix A - MIPS CPU ....................................................................................... 87
Appendix B - Trap handler based on MUX ........................................................... 96
Appendix C - Trap handler based on ICAP ........................................................... 98
4
(Word count 16033)
5
List of Tables
Table 1 Configuration speeds with ICAP achievement (Hansen, Koch and
Torresen, 2011). ................................................................................................... 25
Table 2 Type of MIPS instructions (Fritzell,2013). ................................................ 30
Table 3 Descriptions of using ICAP_SPARTAN6 Port (Xilinx Inc, 2015). ............. 52
Table 4 Custom instructions’ address and ID........................................................ 60
Table 5 An example of bitstream for the IPROG command using ICAP (Xilinx Inc,
2015). .................................................................................................................... 61
Table 6 Resource requirements for Configuration controller. ................................ 76
Table 7 Resource requirements for Custom modules. .......................................... 77
Table 8: comparison between Xilinx Embedded processors with our soft-core and
their Performance. ................................................................................................ 79
Table 9 Software requirements. ............................................................................ 80
6
List of Figures
Figure 1 Classification of FPGAs (Koch, 2013). .................................................... 21
Figure 2 Baseline model of partial reconfiguration (Koch, 2013)........................... 22
Figure 3 Styles of reconfigurable modules placement. (a) Island style. (b) Slot
style. (c) Grid style (Koch, 2013). .......................................................................... 23
Figure 4 a) a typical CPU b) extensions CPU with Reconfigurable Instructions
(Koch, 2013). ........................................................................................................ 31
Figure 5 Design and development tools (Minev and Kukenska, 2007). ................ 34
Figure 6: The general approach of the system development stages (Soft, 2013). 37
Figure 7: A step-by-step design and implementation method. .............................. 38
Figure 8: First step, system overview. ................................................................... 39
Figure 9 Third Step, system overview. .................................................................. 40
Figure 10: Four step, system overview. ................................................................ 40
Figure 11 Five step: system overview of the first approach. ................................. 42
Figure 12 Five step: system overview of the second approach. (Xilinx, 2012). ..... 43
Figure 13 Xilinx Spartan-6 LX16 FPGA platform (Nexys3™ Board Reference
Manuall, 2013). ..................................................................................................... 46
Figure 14 The final system design. ....................................................................... 49
Figure 15 The non-pipelined MIPS shows the most important signals and logics
(Fritzell, 2013). ...................................................................................................... 50
Figure 16: ICAP Primitive (Xilinx Inc, 2015). ......................................................... 52
Figure 17 Custom Module Logic. .......................................................................... 53
Figure 18 The Program Counter process overview that consists of extra logic and
flip-flops to handle branch and jump instructions. (Fritzell, 2013). ........................ 55
Figure 19 Datapath for the multiplication, allowing two clock cycles for execution.
(Fritzell, 2013). ...................................................................................................... 57
Figure 20 Adding Custom instruction in the compiler. ........................................... 58
Figure 21 Trap Handler State Machine. ................................................................ 59
Figure 22 Custom Instruction (CI) act as extension of the ALU ............................ 61
Figure 23 On-FPGA Communication for Custom Instructions............................... 62
Figure 24 Static implementation ............................................................................ 64
Figure 25 Partial Part: the example shows the implementation CRC instruction. . 66
Figure 26 GoAhead GUI. The graphical user interface of the GoAhead. .............. 68
7
Figure 27 GoAhead Script. ................................................................................... 68
Figure 28 Test-bench of the MIPS CPU and ROM all pictures above a, b and c are
presenting one test bench that shows different signals for example A) instruction
encoding, decoding and ALU functionalities b) Program counter functionality and
c) branch delay and ROM functionalities. ............................................................ 71
Figure 29 Modalism Simulation of CRC-32 Module. ............................................. 72
Figure 30 Modalism simulation of One Counter Module. ...................................... 72
Figure 31 Modalism Simulation of Parity generation module. ............................... 73
Figure 32 Modalism simulation of Leading Zero Counter Module. ........................ 73
Figure 33 Modalism Simulation of Mux Based TrapHandler. ................................ 74
Figure 34 Modalism simulation of ICAP based Trap Handler. .............................. 74
8
Abstract
RUN-TIME CUSTOMIZATION OF
A SOFT-CORE CPU ON AN FPGA Rehab Abdullah Shendi
A dissertation submitted to the University of Manchester For the degree of Master of Science, 2015
The use of customised soft-core processors in which instructions can be
integrated into a system in application hardware is increasing in the Field
Programmable Gate Array (FPGA) field. Specifically, the partial run-time
reconfiguration of FPGAs in specialised processors for a particular domain can be
very beneficial. In this report, the design and implementation for the customisation
of a soft-core MIPS processor using an FPGA and partial reconfiguration (PR) of
FPGA technology will be addressed to achieve efficient resource use. This can be
achieved using a PR design flow that helps the design fit into a smaller device.
Moreover, the impact of static power consumption could be reduced due to run-
time reconfiguration. This will be done by configurable custom instructions
implemented in the hardware as an extension on the MIPS CPU. The aim of this
project is to investigate the PR of FPGAs for run-time adaptations of the
instruction set of a soft-core CPU, including the integration of custom instructions
and the exploration of the potential to use the MultiBoot feature available in Xilinx
FPGAs to carry out the PR process. The system will be evaluated and tested on a
Nexus 3 development board featuring a Xilinx Spartran-6 FPGA. The system will
be able to load reconfigurable custom instructions dynamically into user programs
with the help of the trap handler when the custom instruction is called by the MIPS
CPU. The results of this experiment demonstrate that custom instructions in
hardware can speed up a certain function and many instructions can be saved
when compared to a software implementation of the same function. Implementing
custom instructions in hardware is perfectly possible and worth exploring.
9
Declaration
No portion of the work referred to in this dissertation has been submitted in
support of an application for another degree or qualification of this or any other
university or other institute of learning.
10
Copyright
i. The author of this thesis (including any appendices and/or schedules to this
thesis) owns certain copyright or related rights in it (the “Copyright”) and s/he has
given The University of Manchester certain rights to use such Copyright, including
for administrative purposes.
ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic
copy, may be made only in accordance with the Copyright, Designs and Patents
Act 1988 (as amended) and regulations issued under it or, where appropriate, in
accordance with licensing agreements which the University has from time to time.
This page must form part of any such copies made.
iii. The ownership of certain Copyright, patents, designs, trademarks and other
intellectual property (the “Intellectual Property”) and any reproductions of copyright
works in the thesis, for example graphs and tables (“Reproductions”), which may
be described in this thesis, may not be owned by the author and may be owned by
third parties. Such Intellectual Property and Reproductions cannot and must not
be made available for use without the prior written permission of the owner(s) of
the relevant Intellectual Property and/or Reproductions.
iv. Further information on the conditions under which disclosure, publication and
commercialisation of this thesis, the Copyright and any Intellectual Property and/or
Reproductions described in it may take place is available in the University IP
Policy (see http://documents.manchester.ac.uk/DocuInfo.aspx? DocID=487), in
any relevant Thesis restriction declarations deposited in the University Library,
The University Library’s regulations (see
http://www.manchester.ac.uk/library/aboutus/regulations ) and in The University’s
policy on presentation of Theses
11
Acknowledgements
I would like to thank my supervisor, Dirk Koch, for giving me the opportunity to
work in my favourite, and dream field in Computer Sciences: Computer System
Engineering. His remarkable teaching and coaching strategies enabled me to give
my best from the first day; without him this dream would not have been realised.
Special thanks also go to my parents, sisters and my small family - Fahad, Qusai
and Retal - for their help and encouragement during my studies. Thanks also to
my friends for their support.
12
Dedication
From my heart to my brother—you are still here in my heart and mind. I miss you
always, my best friend.
13
Chapter 1
1 Introduction
Field Programmable Gate Arrays (FPGAs) have become popular over the last
decade as they allow designers to create complex digital designs at a low
implementation cost. Application Specific Circuits (ASICs), in contrast, introduce a
high initial cost and require a large amount of resources to create complex
designs.
Modern FPGAs now occupy central positions in industry because of their capacity
for over 1000 multipliers, megabytes of on-chip memory, hundreds of thousands of
logic cells and clock speeds of up to half a gigahertz. Moreover, the cost per
function in FPGAs decreases significantly over time (Koch, 2013).
Partial Reconfiguration (PR) is one of the most important features of modern
FPGAs provided by the FPGA vendor Xilinx. It allows modules running on an
FPGA to dynamically reconfigure and swap during execution while the remaining
modules continue operating. PR is an interesting topic for research among
students and researchers in the Reconfigurable Computing and Adaptive
Hardware field. FPGAs are less efficient in area, power and speed than ASICs;
however, it is possible to make them more efficient than a static system when all
or parts of the hardware are reconfigured at run-time through the execution
operation.
The extension of a soft-core instruction set with user-defined instructions used to
speed up the execution of an application in a specific domain can provide huge
PR benefits. Such benefits include integrating different sizes of reconfigurable
modules into the system to be placed on an FPGA at run-time, and being able to
communicate efficiently with the rest of the system and avoiding additional delay.
In this project, the extension of a MIPS soft-core, user-defined instruction set will
be introduced with the help of PR. The aim of this project is to explore the efficient
use of partial run-time reconfiguration with a CPU instruction set extensions
library.
14
This chapter presents basic information about the project. Section 1.1 describes
the aims and objectives of the project, and section 1.2 presents the report outline.
1.1 Aim and Objectives
The aim of this project is to investigate Partial Reconfiguration (PR) of FPGAs for
run-time adaptations of the instruction set of a Soft-core CPU, including the
integration of custom instructions by presenting a practical introduction to soft-core
processor with extension design through the use of step-by-step integration of the
system for partial reconfiguration using GoAhead tool flow. The powerful GoAhead
tool supports all recent Xilinx FPGAs and includes some features that are not
available in the other PR tools provided by the FPGA vendor Xilinx (Beckhoff, et
al., 2012) as will be introduced in chapter 3.
The objective of this project is to investigate a custom instruction module library
that offers low latency performance; low implementation costs in terms of logic
resources, and achieves high CPU clock cycle savings compared to software-only
implementations.
• Learning Objectives
– Investigate and understand the concept of reconfiguration hardware.
– Review how custom instructions can be applied as an extension of the soft-
core.
– Investigate and understand the concept of PR.
– Investigate and understand the topic of reconfiguration MultiBoot and its
potential for use with PR.
• Deliverable Objectives
– Develop and implement custom instructions as an extension of a given soft-
core on an FPGA.
– Understand and implement reconfigurable custom instructions for a soft-core
on an FPGA.
– Analyse previous results and establish a performance concept.
15
1.2 Report Outline
Chapter 2: Background
This chapter will provide an overview of the relevant literature and related works
as an introduction to reconfigurable hardware and FPGA architecture. PR
concepts and details regarding the reconfiguration of FPGA devices will be
included. Finally, microprocessor architecture, with a focus on MIPS and
reconfigurable instruction set extensions, will be introduced.
Chapter 3: System design and methodology
This chapter introduces the system methodology considered for this project. The
whole system used in the project, including the MIPS CPU and the peripheral
components (memory, GPIO, ROM, and trap handler) connected by the system
bus will be presented.
Chapter 4: System implementation
This chapter discusses the implementation of the final system’s components and
all related technical issues.
Chapter 5: Testing, results and evaluation
This chapter presents the tests conducted in this study, the results of these tests
and an overall evaluation of the system.
Chapter 6: Conclusion and further work
This chapter summarises the report and presents recommendations for further
improvement of the implemented system.
Appendix
Three appendices have been included:
Appendix A contains the VHDL-code for the MIPS CPU
Appendix B contains the VHDL-code for the trap multiplexer
Appendix C contains the VHDL-code for the trap handler.
16
Chapter 2
2 Background
Three areas are dealt with in the background research. Firstly, the general area of
reconfigurable computing including FPGAs architecture is discussed. Then, in the
second part, Microprocessor architectures are discussed. Finally, the third part
looks at the specific area of this project.
2.1 Reconfigurable Computing
Reconfigurable computing is a computer paradigm that combines the flexibility of
software with high hardware processing performance through the use of flexible
high speed fabrics such as FPGAs. Reconfigurable computing provides the ability
to make substantial changes to the data path with the control flow. Additionally,
reconfigurable computing is able to adapt the underlying hardware during run-time
by providing the option to load a new circuit on the reconfigurable fabric (Koch,
2013).
2.1.1 History
According to Bobda (2008) the history of reconfigurable computing can be traced
back to 1960s when Gerald Estrin proposed a computer architecture that was
made up of a standard processor combined with an array of reconfigurable
hardware. The core processor was used to control the behaviour of the
reconfigurable hardware. Such a design was later adjusted to perform other tasks
such as image processing (Lysaght & Subrahmanyam, 2005). The adjustment
was commonly done whenever the need arose. These adjustments could be
performed whenever the need arose and led to the development of a hybrid
computer structure that possessed both software flexibility and speed.
Since then, the design of reconfigurable computing has improved as many
architectures have been developed by industry. Some of the designs that have
been introduced to the market include Copacobana, Elixent, Silicon Hive, PiCoGA
etc. The first reconfigurable architecture based computer for the commercial
market was released in 1991 by Algotronix. This architecture was later adopted by
17
Xilinx, which acquired Algotronix to improve it for commercial purposes
(Algotronix.com, 2015).
2.1.2 FPGA
Field Programmable Gate Array (FPGA) technology has recently gained a lot of
popularity in production and prototyping products in both small and moderate
quantities. FPGAs are a special kind of Programmable Logic Devices (PLD) that
allows the implementation of general digital circuits with a limitation of the circuit
size. Programming the device is used to define the circuit to be implemented. The
capabilities of FPGAs have grown over the years and today a whole
multiprocessor system can fit on a single device. The complex circuit designs
needed for such complex devices are normally specified with the help of Hardware
Description Languages (HDLs). As they support circuit description with the help of
high-level language constructs, HDLs are preferred for this type of application.
FPGAs are comprised of a chip full of digital logic which allows for programmable
connections between components. FPGA design tools are used to generate
configuration files that contain the initial values and the required connections
which can then be downloaded to the FPGA. The key feature of FPGAs lies in the
fact that their design is completely soft and that it can be reprogrammed. However,
this also means that if power is removed from them, they will lose their
configuration. As such, they will require reprogramming in order to create another
working design (Balwaik, et al., 2013).
The history of FPGAs dates from the late 80’s with the increasing interest in
extending the functionality of large Programmable Logic Arrays (PLAs) that were
being further developed (Bobda, 2008). The early 90’s witnessed the increased
use of FPGAs in the networking and telecommunication industry due to their
increased flexibility. At that time, they were preferred because it was possible to
separate the development stage and hardware design from the logic design stage.
As such, they were seen as helping vendors to engineer solutions without
spending lot of time in designing the logics which was the case in Application
Specific Integrated Circuits (ASICs) (Parvez and Mehrez, 2011).
A part of the background to this study is soft-core processors. A soft-core
processor is regarded to be a microprocessor that is completely described using a
18
Hardware Description Language and is synthesized for FPGAs. At this point it is
worth mentioning that the design of a soft-core processor that has been designed
for an FPGA is considered to be flexible due to its ability to be readjusted by
reprogramming the device. This is not possible with much other programmable
hardware. Traditionally, such systems could be developed using ASIC technology.
However, ASICs are traditionally not designed for allowing reconfiguration. FPGAs
have been demonstrated to create very powerful and highly performing systems
because of their reprogramming feature (Musoll, 2010).
One limitation of FPGAs is that very few details of the low level implementation
process are available to the end users (e.g. the encoding of the configuration
data). Sufficient information about the choices made during the development
process of FPGA technology is not often provided.
FPGA Architecture
FPGA technology can be implemented using arbitrary user logic. There are three
main resources available in FPGAs: 1) logic blocks, 2) I/O blocks and 3) a
programmable interconnection.
Logic blocks
FPGA logic blocks consist of a look-up table (LUT) and flip-flops (FF). Each logic
block has the ability to implement small functions consisting of several variables.
The FPGA implements the Boolean logic with the help of the LUTs, which are the
basic elements in FPGA architecture, providing the capability of programming
whenever given a logic function (as long as it fits into the LUTs). A Boolean
function is normally represented by a truth table stored in static random access
memory (SRAM) cells. A LUT is normally linked to specific inputs; those with n
inputs are referred to as n-LUTs (Munden, 2005). As such, an n-LUT is essentially
a multiplexer that takes input signals from the configuration storage memory and
forwards the selected one into an output signal line.
LUT outputs are normally linked to the state flip-flop, which is supposed to store
the current state of the synchronous circuit. Practical look-up tables provide
additional features that vary among different families of FPGAs. Some of the
features witnessed on different FPGAs include distributed memory modes and the
potential to combine adjacent LUTs with larger LUTs that have more inputs and
19
fast-carry ripple chain logic (Pedroni, 2010). In other words, LUTs are combined
during the routing implementation with configurable registers and multiplexers in
order to produce a logic cell. A logic cell is the main thread of the FPGA fabric in
that all unmapped logic in special blocks like DSPs, CPUs or BRAMs is
implemented in logic cells. Xilinx FPGAs, for example, have recently begun to
provide four logic cells combined as a slice, creating a configurable logic block
(CLB) (a combination of two slices). Slices can consist of logic other than basic
logic cells to implement fast carry chains, shift registers and distributed RAM by
adding dedicated signals and logic between slices in the same column to
propagate signals through many slices. This removes the need for routing through
the interconnect.
I/O blocks
I/O blocks are used to connect the internal logic to the outside pins. I/O blocks are
bidirectional, meaning they can either be used as inputs or outputs depending on
the actual configuration. Different pins may be configured to different standards if
the underlying device can support more than one I/O standard (Munden, 2005).
Programmable interconnection.
Programmable interconnections are used to connect different logic blocks. The
interconnections between FPGA logic blocks may be programmed in three ways:
via SRAM cells, FLASH/electrical erasable programmable memory (EEPROM) or
antifuses. These hold the configurations defining the Boolean function and control
the configured routing.
The majority of FPGAs are SRAM-based programmable interconnects. The SRAM
cells drive pass transistors, tri-state buffers and multiplexers. SRAM is a volatile
memory technology and needs to be programmed from an external memory each
time power is applied to the device. During reconfiguration, these SRAM cells will
be overwritten with new functions. FLASH, which is based on EEPROM
technology, is non-volatile and will retain configuration data when power is
removed from the device. Antifuse-based programmable interconnects create
permanent connections in the configuration cells. Similar to FLASH, these
interconnects may only be programmed a single time, after which point the
configuration process cannot be redone.
20
The above architectures indicate the complex programming capabilities of FPGAs
and may account for some of the problems involved in FPGA use. These
problems include the fact that FPGAs consume a lot of power during programming
and they also require a large amount of space which results in latency of routing
and functional blocks. FPGAs also consume a significant amount of power and
configuration memory during operation. Compared to ASICs, FPGAs also exhibit
longer circuit delays (Lin et al., 2008).
Configuration details
FPGA configuration occurs when a bitstream is written onto a device’s
configuration port. The bitstream contains data for the SRAM cells that hold the
device’s configuration. There are two types of configuration ports: external and
internal. They have different interfaces to accommodate specific protocols and
connections. Xilinx FPGA devices support regional reconfiguration on the device
during run-time. The smallest region is a reconfigurable one, and its configuration
frame varies in size depending on the device.
2.1.3 Reconfiguration Hardware
The processors used in computing may be classified into three types (Bobda,
2007). The first, a general purpose processor (GPP), employs data, a control path
and a data path to conduct computation, and does not necessarily alter the
existing hardware. The second, a domain-specific processor (DSP), is used in
situations in which a processor is only employed in one particular computation
area. DSP data paths and operations are fitted to a set of algorithms, which
reduces flexibility though boosted performance for underlying domains. The third
type, an application-specific processor (ASIP), achieves the best performance by
directly executing the hardware algorithm. Moreover, it does not employ
instructions, which implies that unlike the other processors, it is not restricted by
the need for sequential implementation.
The ideal processor would be one that combines the flexibility of GPPs with the
performance power of an ASIP. Modern FPGA technology makes this possible as
they can adapt to different problems in a form called reconfigurable hardware, in
which all or parts of the hardware structure can be changed during execution.
Despite the high static power consumption of modern FPGA devices, run-time
21
reconfiguration can create flexible hardware by increasing device utilisation
through device reconfiguration.
The architecture of FPGAs can be seen from the perspective of their configurable
capabilities: the highest level of FPGA can be separated into one-time
configurable devices and reconfigurable devices. Figure 1 illustrates the major
classifications of FPGAs in regard to their configuration capabilities.
Figure 1 Classification of FPGAs (Koch, 2013).
A globally reconfigurable device allows complete device configuration exchange,
while partially configurable devices permit the exchange of only a fraction of the
FPGA resources. PR can be accomplished either with active or passive operations
(i.e. if the FPGA continues or stops operation during configuration).
2.1.4 Partial Reconfiguration
PR is associated with the ability of a reconfigurable device to change a portion of
the reconfigurable hardware circuitry while the other portion is still running. Such
reconfigurable designs require modular circuits created by different
subcomponents. It is possible to swap out some sections of these subcomponents
even when the FPGA is still running (Koch, 2013).
A full reconfiguration operation is normally done when the FPGA is in the reset
mode, at which time an external controller is employed to reload the design into
the chip; this improves functionality to critical parts of the design. In addition, PR
can be used to create space for multiple modules at run-time by storing the
partially reconfigurable modules expected to be changed. Figure 2 illustrates the
baseline model of PR.
22
Figure 2 Baseline model of partial reconfiguration (Koch, 2013).
Figure 2 shows how active modules are exclusively placed within the
reconfigurable region and how the swapping between the modules is
accomplished through writing a partial configuration bitstream to the configuration
port, as seen by the configuration data stream in the right hand side of Figure 2.
PR is available in most modern FPGAs and allows a subset of the logic fabric to
be dynamically reconfigured while the logic in it continues to operate undisturbed.
Some of the FPGAs equipped with this capability include the devices of FPGA
vendors Xilinx and Altera, which include this feature on their high-end FPGAs. PR
is not only necessary for general purpose reconfigurable systems but is preferred
due to its extensibility and flexibility (Koch, 2013).
To undertake partial run-time reconfiguration, hardware must be supported by the
devices mentioned above. Reconfiguration in one section of the device must not
stop operation in other sections. PR may be classified according to the frequency
of reconfiguration applicable within an operation clock cycle. These classifications
are: single-cycle reconfiguration (frequently applicable), sub-cycle reconfiguration
and multi-cycle reconfiguration (seldom applicable). In multi-cycle reconfiguration,
reconfiguration requires more than a single system clock cycle because the
reconfiguration data is transferred from memory to configuration cells in a serial
fashion. Single-cycle reconfiguration occurs when a redesign involves a change of
logic on the device within a single chain of the system clock. Context switching
may not be undertaken in run-to-completion modules, as the module’s internal
state would not be stored.
The reconfigurable system is divided into two important parts. The part of the
system that is always present is called the static region, and can include a
23
memory controller, a soft CPU or configuration port interface logic. The second
part, which contains run-time reconfigurable modules, is typically provided as one
or more partial regions. Different methods of conducting PR exist, including small
changes in net lists, routing and LUT functions, or even large module replacement
(Koch, 2013).
Style of module placement
There are various methods available for PR; for example, the manner in which the
area set for PR is employed categorises PR into different styles of configuration.
One method of conducting PR is substituting larger portions of logic known as
modules for every reconfiguration. This is termed module-based reconfiguration.
The area where PR modules are placed could be: a) only one module in a
reconfigurable region b) in a one dimensional fashion or c) a two dimensional
fashion. The following figure 3 shows the partial region and the different styles that
can be arranged in it.
Figure 3 Styles of reconfigurable modules placement. (a) Island style. (b) Slot style. (c)
Grid style (Koch, 2013).
Island styles are supported by the Xilinx PR flow. In the “island style", only one
module will be present in the PR region, while switching between other modules
could be carried out in the static part of the system. A PR region has to
accommodate all modules that the system will need. The design could be a single
or multi island style. With the latter one the developer should consider that the
same resources will be shared by all of the islands. On the other hand, in the “slot
style", PR regions will be divided into slots that have the same size. So, it will be
not be limited to one module as in the "island style". Varying slot requirements for
different modules could cause fragmentation challenges inside the PR region. As
24
a result, replacing modules in the "slot style" will not be as straightforward as in
the "island style", in which there is only the matter of choosing between the islands
(Koch, 2013).
Module footprint
Interchanging modules between various islands/slots found on the device requires
the designer to consider the resources required for the module. It also needs the
existing FPGA frameworks and the manner in which resources are placed on the
device to be considered. The PR module bears a resource footprint which has to
fit the resource footprint of the existing FPGA. Therefore, when a module is
changed to a new group of slots, the slots have to perfectly fit the module
footprint. There are challenges when permitting module relocation. One challenge
is the alteration in signal timing and incorporating a timing footprint. There could
be a change in timing based on the position of the module relocation. Other
sections of the FPGA could have longer delays in routing due to concealed
features, for instance, the configuration logic.
Spartan-6 configuration
Configuration frames are an integral component of the Spartan-6. The
configuration frames for the devices of Spartan-6 could be classified into three
kinds that have specific data for various parts of the device (Xilinx Inc, 2013). They
include: Type 0; Type 1, or the Block RAM; and Type 2, or the IOB. Configuration
is conducted using three kinds of operations that are offered by the configuration
logic. They include: "00": NOP; "01": READ; and the "02": WRITE. The execution
of a configuration command occurs in the event that a configuration register is
drafted using data (Xilinx Inc, 2011). Each and every configuration register is
described in the user guide of Spartan-6-configuration (Xilinx Inc, 2015).
Configuration data is designed into two kinds of packets: Type 1 which has short
blocks of 16-bit data areas; and type 2 in which packets could have long blocks of
multiple 16-bit wide data areas.
Spartan-6 bitstream
In order to configure a Xilinx device a bitstream to one of the configuration
interfaces needs to be applied. The bitstream, as mentioned before, is an
25
encapsulation for the configuration data packets. The format of the bitstream in
Spartan-6 devices is as follows (Xilinx Inc, 2015):
Dummy words: To prepare the pipeline of the configuration interface for
data.
Synchronisation words: Two 16-bits words used for synchronisation
(0xAA99 and 0x5566).
Header.
Configuration body.
Header2.
De-synchronisation word: One word (16-bit) signalling the end of the
bitstream (0x000D).
In the reconfiguration, in order to set up configuration registers, the header will be
used, whereas, in the configuration body, data will be written to the configuration
frames of the device. While Header2 could be also used for setting different
configuration registers.
In the reconfiguration, in order to set up configuration registers, the header will be
used, whereas, in the configuration body, data will be written to the configuration
frames of the device. While Header2 could be also used for setting different
configuration registers.
Internal Configuration Access Port (ICAP)
During run-time reconfiguration, the system will have to write the configuration
data into the configuration cells. In other words, writing data to the Internal
Configuration Access Port (ICAP) on Xilinx devices. ICAP could consider the
internal version of SelectMap port; one of the external configuration ports on
Spartan-6. The following table shows the configuration speeds achievements.
Bit width Frequency MHz Configuration speed Mb/s /MB/s
8 bit 100 800/100
16 bit 100 1600/200
Table 1 Configuration speeds with ICAP achievement (Hansen, Koch and Torresen, 2011).
26
On Spartan-6 devices (Xilinx Inc, 2015), the ICAP_SPARTON6 primitive has an
input (I) data port that can accept 8- or 16-bit words of configuration data and an
output (O) port which is used for read-back of configuration data already present
on the device. Controlling the primitive will be done by setting the write enable
(WRITE) and clock enable (CE) signals. And the data will be read or written by the
primitive on the rising edge of the clock (CLK).
Relocation of partial module bitstreams
Module relocation occurs when the system is able to shift modules between
various slots, as opposed to fitting a module to a particular slot in the PR area.
The benefit of module relocation is the achieved dynamism in module placement.
Challenges including external fragmentation can be handled with ease because
modules can be eliminated between various slots. In addition, its flexibility makes
the task of discovering placement and module scheduling much easier. This is
because every module matches more than a single slot. There are various
methods of executing module relocation. One such method will be to establish a
different bitstream for every slot one needs to put his module in. A major solution
in reducing storage within a system which boosts module relocation will be to keep
position independent bitstream data distinct from position dependent. Based on
this, it is just the position dependent data that must be kept for every position.
2.2 Microprocessor Architecture
2.2.1 RISC Microprocessor
RISC, or Reduced Instruction Set Computer, is a type of microprocessor
architecture that is designed to have instruction sets consisting of small, same size
and simple instructions in order to make the whole architecture faster by executing
them within one cycle. Moreover, RISC CPUs require less use of the memory
when they are designed with a larger number of registers and only two dedicated
instructions; load and store instructions that allow access to the memory.
Whereas, CISC, Complex Instruction Set Computing, which is the opposite of
RISC, can perform memory access from many different instructions. Examples of
well-known RISC processors that are used widely in different hardware devices
around the word are DEC Alpha, AMD Am29000, ARC, ARM, Atmel
27
AVR, Blackfin, Intel i860 and i960, MIPS, Motorola 88000, PA-
RISC, Power (including PowerPC), RISC-V, SuperH, and SPARC.
2.2.2 Soft-Core Microprocessor
Soft-core processors have been wholly implemented using logic synthesis and
through different semiconductor devices containing programmable logic. There are
many soft-core processors that have been targeted for FPGA implementation. A
typical soft-core CPU includes instruction sets, register files, arithmetic-logic units
and other features eventually. The performance of these Soft-core CPUs
implemented on FPGAs is considered to be higher when compared to those
implemented on ASICs architecture. The disadvantage of an FPGA
implementation is that it involves additional reprogramming capability that is not
found in the ASIC architecture. However, the soft-core CPU created can be
improved, if a problem with the design is found. This is one of the advantages of
FPGA technology over the ASIC technology. For example, a new performance
requirement of the CPU can be matched by adjusting the parameters on the
FPGA of the system.
As mentioned above, there are many types of soft-core CPUs and corresponding
development tools. Some popular soft-core CPUs include; Xilinx MicroBlaze,
Altera Nios/NiosII, LatticeMico32 etc. These CPUs offer logic and memory
elements that have several intellectual property peripherals which are required in
the rapid development of System-on-Programmable-Chip.
A number of the soft-core processors that have been developed using FPGA
technology are discussed below, and their functional details and performance
provided (Levy and Conte, 2009).
MicroBlaze soft-core processor
One of the most popular soft-core processors is the MicroBlaze soft-core
processor from the FPGA vender Xilinx. It has a 32-bit Reduced Instruction Set
Computer (RISC) architecture and can be customised with a number of memory
and peripheral configurations. There are three pipeline stages that contain
variable length instruction latencies. The Xilinx Platform Studio software can be
used in the design process which provides a user-friendly environment that is able
to generate MicroBlaze system. This type of architecture was adopted the Havard
28
memory architecture which consists of two local memory busses: one that is used
to connect the data memories; the other that is used for the instructions. The
number and size of memory peripherals can be selected by the user. The
processor is capable of operating at up to 200MHz in Virtex -4 devices (Le Gal
and Jego, 2013).
NIOS II Soft-core processor
This soft-core CPU has load-store RISC architecture. The processor consists of
many architectural parameters that may be configured easily at the time of design.
For example, the user has a chance to choose between 32 or 16 bits of datapath
width, cache size and register file sizes. There are custom instructions used to
help the user to customise the hardware this could be used to accelerate the CPU.
The integration of off-the-shelf intellectual property is readily realised, thus
reducing the time that is required to set up a SoC and design time
(Microelectronics International, 2012).
Micro32 Soft Processor core
This is another example of soft-core processor but one that is in many ways unlike
the other two examples that have been discussed above. Although it employs
RISC architecture just like the above two examples, it is completely open.
Additionally, it uses a smaller number of LUTs on the FPGA which makes it
cheaper when compared to the others and it is easy to configure for the options
you want to have in your application (Chu, 2008).
2.2.3 MIPS Architecture
MIPS Overview
In this project, the MIPS architecture, shown in Figure 15 below, will be used as a
demonstrator for the custom instruction implementation in hardware. It is used to
implement a 32 bit embedded system. Moreover, it is an example of RISC
architecture and one of the most widely supported processors and has been used
in research on efficient processor organisations which can deliver the highest
performance and high power efficiency.
The original MIPS architecture consists of the following functional blocks:
29
Instruction decoder: It will decode the simple MIPS instructions since all
instructions are the same size with only three different formats.
Programme Counter (PC): It contains the address of the currently executed
instruction and then increments the stored value address of the next instruction by
4. In the case of there being a branch or jump instruction, a delayed branch will
occur, which means one more instruction is performed and the value that is
provided by the branch or jump instruction will be added to the instruction address.
Arithmetic Logic Unit (ALU): it is a fundamental block of the CPU that performs
arithmetic and logical operation on the operands, which are the data inputs to an
ALU to be operated on, from register to register, memory to register or vice versa.
Registers: the MIPS processor has 31 general purpose registers including
register 0 that holds a constant zero. The other registers will be used by the
compiler as outlined in the "MIPS32® Instruction Set Quick Reference"
Memory: It will be only accessed via load and store instructions.
Pipeline registers are often placed between the functional blocks in order to allow
the processor to run at high clock speeds and to minimise the delay. Basically, the
MIPS processor has been designed to use pipelining to improve throughput and
performance. It includes a 5-stage pipeline: Instruction Fetch, Instruction Decode,
Execute, Memory access and Register write back..
MIPS Instruction Set
The MIPS instruction set is divided into three core groups of instructions. Each
one of them has its own encoding, as illustrated in the following table.
Instructions
type
BITS
31-26 25-21 20-16 15-11 10-6 5-0
R-type opcode rs rt rd shamt funct
I-type opcode rs rt immediate
J-type opcode address
30
Table 2 Type of MIPS instructions (Fritzell,2013).
Table 2 shows that each type has a 6-bit main opcode that can be used by the
decoder to determine the instruction, while the other fields, rs, rt and rd, will be
address vectors in the registers file. Those instructions are used for:
• R-type instructions are Arithmetic Instructions that use two operands from
the register file, rs and rt, and the result of the operation will be returned to
the register rd. The R-type instruction could share their opcode with other
instructions and funct-code will determine the operation.
• I-type instructions are Load / Store Instructions that use a register, rs, with a
constant value, coded as the immediate, the result will be returned to the
register rt. The I-type instruction could be used for braches, so the
immediate will be added to the current PC to perform a branch.
• J-type instructions are Jump instructions that provide a new address for the
programme counter. This means moving the execution to a new code
block.
2.3 Reconfigurable CPU Instruction Set Extensions
Many different applications could be handled by using only GPPs, General
purpose processors. However, most of them could use only a small subset of all
the available instructions in the GPP. Therefore, some small changes to dedicated
hardware in any application could give a huge improvement in execution time. A
compression algorithm, for example, would need to count the number of one-bits
in a vector. By adding dedicated hardware instruction, the speed up of this
algorithm will be increased.
Extending the instruction set of a CPU could be one way to do this, allowing for
hardware acceleration of small parts of an application. The Microblaze and the
Nios soft-core CPUs from Xilinx and Altera are good examples of CPUs that allow
custom instructions with the benefits of a fast RISC machine. The next section will
highlight the interesting points regrading custom instructions.
31
2.3.1 Custom Instructions in Hardware
Custom instructions enable a designer to implement a complex sequence of
standard instructions into a simpler and single instruction built in hardware. The
simple description of implementing such a custom instruction in a MIPS CPU, and
one that can access the register file in the same way as an ALU is shown in Figure
4.
Figure 4 a) a typical CPU b) extensions CPU with Reconfigurable Instructions (Koch, 2013).
Figure 4 shows that extending the CPU with exchangeable instructions could be
done after decoding unused instruction in the original CPU ISA. Then a
multiplexer is used in order to select between normal ALU option and one or more
user defined instructions. Then the configurable instruction can be integrated into
the CPU (Koch D, 2013).
The custom instruction logic block has two input ports and one output result, as
shown in Figure 4. Often, custom instructions operate in a single clock cycle.
However, a multi-cycle operation can be considered for longer combinatory paths.
Through the use of custom instructions, it becomes possible to tailor the processor
core to a certain application.
One way to emulate the configuration instructions is by adding large
reconfigurable accelerator modules multiplexer that can be placed outside the
CPU on the system bus. However, this approach will involve an additional cost.
Another way to configure such a custom instruction in hardware is by using run-
time partial reconfigurable. The custom instruction could be placed in small
slots/islands close to the MIPS CPU, which could cause routing congestion
because a high number of signals need to be entered inside the small area.
Devices from Altera or Xilinx support design flow tools such as PlanAhead, Open
32
PR and GoAhead flow, such a design flow can communicate between the static
system, which includes MIPS CPU, with custom instructions as they can
implement the interface between the static and partial system. By using bus
macros, proxy logic or direct mapping wired technique that are provided by
PlanAhead, OpenPR and GoAhead flow tools respectively.
Fritzell (2013), who proposed a fast dynamic partial reconfiguration system using
GoAhead, argued that with a high number of signals and small islands/slots,
design flows using bus macros or proxy logic could not give good results,
considering the communication overhead. He shows that by using GoAhead with
the direct wire approach, the implementation of the custom instructions can be
very efficient in small islands/slots. Consequently, the modules can be relocated.
The benefits of allowing the custom instruction to be relocated in more than one
slot are the flexibility of slot utilisation, the reduction of the external fragmentation
and the removal of unnecessary reconfiguration calls as mentioned by (Koch et
al., 2010). As a result, the processor will need a look-up table to store a location of
a slot that has a custom instruction so that the decoder will know from which
custom instruction slot the result should be routed (Fritzell, 2013).
2.3.2 Custom Instructions in Software
Reconfiguration of custom modules could be done either by run-time partial
reconfiguration or by a multiplexer that emulates the configuration process, as
already mentioned above, and the reconfiguration time could be the biggest
overhead. So, in order to trigger the configuration process, there are two
fundamental options:
Explicit approach: the configuration instruction will be loaded during the
execution time by the user or by the program, before the processor needs it.
Hauck (1998) proposed this method as the configuration pre-fetch instructions
before the instruction is called. It could be fast. However, the speed of the
configuration controller and the size of the bitstream will affect the time that the
reconfiguration of the custom instruction takes. Consequently, the processor must
be stalled, if the configuration of the custom instruction is not finished before the
processor calls it.
33
Implicit approach: an exception trap will be triggered when the processor detects
that the custom instruction is not in hardware. The trap handler will handle the
configuration process of the custom instruction that the processor needs. The trap
handler could run a program (Lynch, Forin and Pittman, 2006) that the software
function will be executed when the custom hardware is not configured. This
approach could remove a lot of overheads by not stalling the CPU while the
configuration is in progress. However, it could take time to handle the trap.
2.4 Design Considerations
The development of a customisable CPU on FPGAs requires the consideration of
critical system factors in order to attain the desired performance. Some of the
critical objectives that are normally taken into consideration include the speed of
the CPU, the memory, the power required and the speed with which the CPU can
access other components of the system. There is usually a trade-off between the
performance and the power required to attain such performance (Kulkarni, 2006).
The additional design considerations of a customisable configuration include the
architecture of the processor and its suitability for the targeted application. This
implies that the designer will have to take into consideration the size and type of
memory and peripheral bus. In addition, the designer will have to decide on the
model and size of the address space that is confined to the CPU, space and type
of the caches and instruction and data caches. It is also important to give
consideration to the type of controllers that are being used in the architecture.
Optional accelerators might be used to speed up the CPU (Deschamps, Sutter
and Canto, 2012).
It should also be mentioned that the operating system and the design and
development tools are part of the considerations that will have to be evaluated by
the designer. The biggest advantage of implementing the soft-core CPU using
FPGA lies in the fact that in the case of any mistake being committed during the
development phase, there is the possibility of repeating the process to reconfigure
the parameters afresh. There are no limits to the number of times the processor
can be reconfigured. This provides designers with a degree of design flexibility
(Kozyrakis and Patterson, 2004).
34
The designer will have to take into consideration the development and design
tools that will be used to develop the soft-core. The following figure provides an
illustration of the design and development tools. The design and development
tools are considered to be responsible for the parameterisation of the soft-core
and also the associated implementation of the peripherals (Kilts, 2007).
Figure 5 Design and development tools (Minev and Kukenska, 2007).
FPGAs allow extensive customisation alternatives that are not found in other
platforms such as ASIC. Additionally, an FPGA is also considered to have
optimisation techniques that help a designer to work towards achieving
performance metrics faster (Gebotys, 2002). The benefits of using an FPGA
platform in customising soft-core CPUs have also be reviewed. The development
of a customisable CPU on FPGA requires critical system factors in order to attain
the desired performance (Gebotys, 2012).
Evaluation of the design and development tools will help the designer to easily
and quickly attain the design requirements. It should also be noted that the wrong
choice of design and development tools can lead to system inefficiencies. The
design and development tools are considered to be responsible for the
35
parameterisation of the soft-core and also the associated implementation of the
peripherals (Synopsys, 2010).
2.5 Previous Work
Related work that is relevant to this project can be categorized into two parts:
instruction set extension and partial reconfiguration.
2.5.1 Instruction Set Extension
An example study regarding instruction set extension is that of Altera (2011). This
study demonstrates the ability to extend the NIOS-II CPU with custom instructions
using the SOPC builder wizard of the Quartus design tool. Integrating custom
instructions with a soft-core instruction set is a feasible way of speeding up
application execution in specific domains such as cryptography (MAJZOUB and
DIAB, 2007). Some of the issues involved in the customisation of an instruction set
were analysed in detail by Galuzzi and Bertels (2011), who provided a
comprehensive overview of instruction-set extensions.
2.5.2 Partial Reconfiguration
A fair amount of literature has been published on partial run-time reconfiguration in
the soft-core CPUs of FPGA. These studies have shown that PR reduces the size,
weight, power and cost of an FPGA system. The use of design techniques to
increase performance and resource utilisation of reconfigurable soft CPUs was
studied by Wold et al. (2012). They have investigated the appropriate instruction
implementation technique for a soft CPU which can achieve a performance
improvement, while at the same time reduce the resource requirement. It is a
different task but fairly closely related to what this project is aiming at. Their goal is
to improve soft CPUs for FPGAs using partial reconfiguration. For example, they
presented a classification method that determined the parameters for selecting the
most suitable instruction based on profiling. Instruction Set Extensions, Software
Emulation, Reconfigurable Instructions and ISA Subsetting are the optimisation
techniques used in their methodology.
Reconfigurable instructions could result in a critical side effect in terms of the
configuration time. An example of this could be stalling programme execution
36
while waiting for the reconfiguration process to complete could cause an overhead
(Wold, et al., 2012).
Another study by Koch, Beckhoff and Torresen (2010) involved an approach to
reduce this overhead. They examined the problem which occurs when the
communication needs an extra logic or the placement of reconfigurable modules
needs to be restricted to the static system which causes an additional logic
overhead. They reveal a novel tool called ReCoBus-Builder. In a case study,
modules of different sizes and latency were integrated with soft CPUs without
causing any logic overhead by using partial run-time reconfiguration. For this
project, the newer tool GoAhead, which is a fully re-implemented issue of the
tool ReCoBus-Builder, will be used. However, this study will be a library of
dynamic instruction set extension.
37
Chapter 3
3 System Design and Methodology
This chapter presents the methodology that has been adopted in this project, the
implementation tools and the system design.
3.1 System Development Methodology
Designing and developing such an effective customization soft-core processor is a
challenging task, especially with little experience in processor and system design.
Therefore, a system development lifecycle method and a step-by-step design
approach are appropriate. This can progressively develop a researcher’s learning
experience in this important computer engineering field and developing an
effective system using partial reconfiguration field.
Figure 6: The general approach of the system development stages (Soft, 2013).
Figure 6 shows the general lifecycle stages that were used in this project in order
to develop a processor. The requirement analysis stage has already been
introduced in the objectives section of the Introduction chapter on page 13. The
design and implementation stages used a step-by-step design and implementation
method (Elkateeb, 2011), as shown in Figure 7, and this will be discussed below in
this section. The testing and evolution stages will be introduced in Chapter 4 and
will use an appropriate approach for FPGA Embedded Processors design and
38
evaluation (Fletcher, 2005) such as comparing the system against a software
implementation and comparing with the benchmark system and others real-world
system. Finally, some techniques for optimizing the performance and cost in an
FPGA MIPS processor system will be discussed.
Figure 7: A step-by-step design and implementation method.
When using such a step-by-step design and implementation method, the
customizing soft-core processor has to be done by gradually integrating the
processor module with other system modules and developing other modules to get
the final customization soft-core MIPS processor design with the help of the partial
reconfiguration. Each of the steps is briefly described below.
First step: MIPS CPU: First of all, the soft-core is the brain of the system. A MIPS
CPU has been implemented in one module, using an XOR gate in the top level in
order to synthesise it shown in figure 8. The reason for the XOR gate is that the
MIPS CPU used more interface wires than there are I/O pins available on the
FPGA board. By XORing some of the CPU outputs, the CPU could be synthesised
for test purpose (e.g. for data mining clock frequency and resources utilisation).
Testing MIPS instructions encoding and implementation module was done by
using Test Bench in the Xilinx ISE package as is illustrated in the testing section in
chapter 5.
39
Figure 8: First step, system overview.
Second step: Custom instruction in software: A GCC cross compiler is used in
order to compile the MIPS C code. This compiler is modified to include the custom
instructions by assigning the custom instructions to unused opcodes. Accordingly,
this will be used in the instruction decoder to select the instructions from the binary
code. Installing the compiler was done using a virtual machine that was installed
on the Windows operating system.
Third step: One custom instruction in hardware: A custom module that will be
connected with the MIPS is chosen. Then, adding a “Counting One” custom
module as component in the MIPS CPU. The MIPS will detect the custom
instruction and return the result from the custom module. Moreover, the MIPS
CPU module is connected with other modules such as ROM, RAM and GPIO via
system bus.
40
Figure 9 Third Step, system overview.
Fourth step: Custom Instructions library in hardware: Four custom modules
are implemented. In addition, a Trap handler that is based on a multiplexer (MUX)
is developed (Appendix B) in order to choose one custom instruction, the one that
is called by the MIPS CPU. This approach has overhead logic costs as shown in
the result in chapter 5.
Figure 10: Four step, system overview.
Fifth step: Reconfiguration Custom instruction: There are different methods
for implementing reconfigurable custom modules in hardware as already
mentioned in the background chapter. In this project the following approaches
have been implemented.
First approach step: Improving the Trap handler: The Trap handler based on a
MUX is improved to handle the configuration process. In this approach, the trap
handler will be based on ICAP (Appendix C). It is done by implementing the trap
MIPS CPU
Module Instructions
ROM Module
Memory
Module
General I/O
Module
Custom
Module
MIPS
CPU
Module
Instructions
ROM Module
Memory
Module
General I/O
Module
CM
1
CM
1
CM
1
CM
1
MUX
System
Bus
System Bus
41
handler as a state machine which includes a table to save the addresses of the
configuration bitstreams for the different custom instructions as will be introduced
later in section 4.3 and then uses the ICAP primitive in order to load the bitstreams
into the device. We will exploit the fact that all academic boards come with serial
SPI memory that is often not used. The MultiBoot feature is applied in this project;
this allows the FPGA to load one of several configuration revisions. Spartan-6
FPGAs support two different configuration modes: BPI and SPI. The functionality
of this feature is described in detail in [Spartan-6 FPGA Configuration User Guide].
The iMACT will be used to supply the starting address for each configuration
revision in order to generate the MultiBoot SPI file (Xilinx Inc, 2015). SPI PROM is
specified to store the configuration bitstream for the different custom modules.
Consequently, if the custom instruction is needed by the MIPS CPU then the trap
handler will check if the custom instruction is already configured otherwise a
different bitstream will be loaded from an attached external memory (SPI PROM)
into the FPGA. As a result, the FPGA will be reconfigured with a different
configuration bitstream. The testbench, in the test section in chapter 5 shows the
functionality of this module.
The whole process works with full reconfiguration, with respect to the MIPS and
the extension. The reconfiguration will only make sense with partially
reconfigurable custom instructions because rebooting the whole system each time
when different custom instruction is called is not a good idea. So a different
approach comes from investigating the MultiBoot can be used for partial
reconfiguration.
42
Figure 11 Five step: system overview of the first approach.
Second approach step: Exploiting the MultiBoot feature for partial
reconfiguration. As stated in Xilinx’s Partial Reconfiguration User Guide (2012),
PR is a technique for modifying the operation of the FPGA by loading a different
bitstream while it is performing its normal operation. The whole design in this
technique is translated into different bitstreams or files, where each one defines a
separate function and is loaded upon being required. Application Specific
integrated Chips (ASIC) are fabricated in the fab and are designed to perform a
fixed functionality. On the other hand, FPGAs offer the flexibility of being
reprogrammed, and most modern FPGAs offer the capability of on-site
programming. In PR, the operation of the FPGA is modified by programming a
partial bitstream (also called bit files), which defines the operation of a subset of
the programmable blocks while in this case the whole FPGA fabric is not
reprogrammed. In such a scenario, first of all a full bit file is programmed into the
FPGA, which defines the operation for the whole FPGA. Then afterwards,
depending on the requirement of the operation, a partial bit file can be
downloaded to modify the reconfigurable parts of the FPGA and the other parts
continue to perform their operation without being affected. The conceptual
diagram of the partial reconfigurable system is shown below in figure12.
MIPS CPU
Module
Instructions
ROM Module
Memory
Module
General I/O
Module
CM
3
Reg
Trap handler
System
Bus
CM ICAP
CM CM
4
MU
CM
2
43
Figure 12 Five step: system overview of the second approach. (Xilinx, 2012).
It can be seen that there is a Reconfigurable Block A in the system, which can be
loaded with one of the possible configurations defined by several BIT files, A1.bit,
A2.bit, A3.bit, and A4.bit. The logic in the FPGA design is divided into two different
regions: reconfigurable region and static region. The dark area of the FPGA block
represents reconfigurable regions and the lighter area shows the static region.
The functionality of the reconfigurable region is defined by the partial bit files and
can be re-programmed by loading one of the partial configurations, while the static
region continues to perform its operation and is not affected by the reprogramming
of the reconfigurable region.
The method of Partial reconfiguration offers several advantages, which include:
– This approach helps to reduce the area or size of the FPGA device required to
implement a given function, which means fewer logic blocks are consumed;
hence, as a result, it also reduces the cost and power consumption of the
device.
– This approach helps to implement and test multiple algorithms or methods to
perform a specific functionality. In such a case, multiple implementations can
be loaded turn by turn and can be compared against each other.
– This technique enhances the design security as specific user dependent
keywords or codes can be included into the reconfigurable region and
reprogrammed by the end user.
– This approach enhances the fault tolerance in the FPGA design, where any
malfunctioning regions or parts can be reprogrammed by the user and can be
debugged.
44
– This approach enables the designer to divide the complete design into multiple
regions or blocks, and these blocks can be added to the FPGA design
incrementally; hence, it speeds up the FPGA design and verification process.
In our partially reconfigurable system, there is a partial reconfiguration controller
implemented in the static region. This partial reconfiguration controller is used to
retrieve the partial bitstreams from any memory connected to the FPGA, and then
forwards it to a configuration port. There are two possibilities for the partial
reconfiguration controller; either it is implemented in an external device such as a
separate processor or in the static region of the FPGA design. In the case of the
partial reconfiguration controller being located inside the static region of the FPGA,
the partial bit files are loaded using ICAP interface. Like the other logic in the static
region of the FPGA, the partial reconfiguration controller logic functions without
being affected by the programming of partial bit files.
The fundamentals and the concepts of the partial reconfiguration for any system
design are discussed above. However, nothing in the documentation provides
information on using the ICAP primitive to send the command sequence for
loading configuration bitsreams in MultiBoot feature for partial reconfiguration.
From this point, partial reconfiguration is applied. The code will be changed to
include a black box that presents the custom instruction wrapper later in order to
perform the down to top syntheses, which is the important concept when
implementing partial reconfiguration. Figure 14 in section 3.3 in page illustrates
this approach.
3.2 Implementation Tools
3.2.1 Hardware Description Language
The circuit for an FPGA is developed using a Hardware Description Language
(HDL). The two most popular hardware description languages used for FPGAs are
Verilog and VHDL. Hardware description languages are used to design circuits
and they are used to capture the complexity of large circuits and they can
significantly increase the productivity of the design process (Wold, et al., 2012). In
short, a hardware description language can be compared to an imperative
45
programming language. However, there are many fundamental differences
between the two programming languages. Normal programming languages are
used to create programmes that are executed by microprocessors. However,
hardware description languages are designed to produce hardware circuits. They
are capable of describing circuit hierarchy and connectivity, providing a built-in
mechanism for simulating circuit behaviour in the software and expressing the
inherent parallelism of separate circuit components (Hauck and Wilson, 1999).
3.2.2 Xilinx ISE (Xilinx, 2013) :
This is an Integrated Synthesis Environment software tool that is provided by the
FPGA vender Xilinx. It is used for the synthesis and analysis of HDL designs and
enables a designer to compile their HDLs designs, (such as VHDL and Verilog
file), to perform timing analysis, to view RTL schematic, to simulate a behavioural
model, and to generate bitstreams for FPGA to configure the target device.
By using a VHDL programming language, different levels of abstraction are
supported by the hardware description languages. The commonly applied
abstraction levels include behavioural and structural modelling. A module is
considered to encapsulate a circuit by defining its interface. In this way the circuit
is able to communicate to the outside world through the input/output ports.
Modules are comparable to classes in object oriented programming. The modules
are normally defined and then instantiated several times. Different instantiations of
the modules can be executed simultaneously and they can also be connected,
mapped and routed using the signals that link their inputs and outputs.
ISim simulator: Hardware description languages are normally associated with
simulation features which provide an insight into the functionality of the circuit
when fabricated. This helps to reduce the risks and costs that are associated with
real fabrication processes. Simulation is normally considered to be crucial in the
implementation and design of hardware circuits. They are both economical and
practical. There are different levels of granularity that are supported for the
simulation of a circuit. The initial stage of simulation seeks to determine the
behavioural correctness of the circuit. In this case, an appropriate benchtest is
generated and introduced to the circuit. The results of such a benchtest are
already known before the simulation. The simulation results obtained are
46
compared to the expected results and the comparison can be used to assess the
correctness of the designed circuit
3.2.3 Cross compiler:
A cross compiler is a compiler which generates code that can be run on a different
system, for example, compiling C code for MIPS architecture (Gnu.org, 2015). For
this project, a GNU cross compiler will be adapted to use reconfigurable
instructions through inline assembly calls.
3.2.4 FPGA Platform:
A Nexys3 digital circuit development platform which is based on the Xilinx
Spartan-6 LX16 FPGA was used, and is shown in Figure 13. The Spartan-6 FPGA
will be used for implementing reconfiguration ISA extension. This provides high
performance at low resource cost. It includes the following features (Digilent,
2013):
– 2,278 slices each containing four 6- input LUTs and eight flip-flops
– 576Kbits of fast block RAM
– two clock tiles (four DCMs and two PLLs)
– 32 DSP slices
– 500MHz+ clock speeds"
Figure 13 Xilinx Spartan-6 LX16 FPGA platform (Nexys3™ Board Reference Manuall, 2013).
47
3.2.5 GoAhead
A tool for implementing partially reconfigurable systems is GoAhead. This tool
supports all of the recent Xilinx FPGAs. It provides some features that the Xilinx
PR tool chain cannot perform, including (Beckhoff, et al., 2012)
– Implemented partial modules that will be completely independent with respect
to the static design.
– Modules that can be relocated and the multi-modules that can be instantiated
– Modules can be integrated without any logic overhead "no bus macro or proxy
logic required ".
– It will provide Hierarchical reconfiguration which allows the implementation of
a PR module inside a PR module.
– Communication architecture generation that enable multiple PR modules to be
hosted simultaneously in the same PR region.
3.3 System Design
In this project, focus is put on embedded systems that have different requirements
in various application domains such as cryptography, network control systems and
image processing. This is due to the fact that an FPGA platform is the most
suitable device to adapt to changes in application requirements (Koch D, 2013).
There are four custom instructions that have been considered as extensions for
the MIPS processor and they are described below:
I. Count ones: Counts the number of ones in a 32-bit vector.
Counting the set bits in the vector is a common algorithm, called Hamming
Weight, and it is used in cryptography and network domains. For example, in a
Hamming distance algorithm, in order to detect the number of bit errors between
two binary numbers, the detection will be obtained by applying XOR gates to them
and then counting the one numbers and the result will be the number of bit errors
(Schiller, 2003).
II. 32-bit CRC: Takes two 32-bit operands and computes a CRC.
A cyclic redundancy check (CRC) is one of the most popular error detection
methods used in networks and in storage systems. It is very useful to detect any
48
errors that have occurred because of the noise in the transmission channel in the
network. For example, the same number between the transmitter and receiver will
be used to detect the error. The CRC calculation will be done in both of them and
the result should be zero if there is no error. CRC calculation can be obtained
sequentially by a shift register and XOR gates or in parallel with XOR gates only
(Schiller, 2003).
III. Leading zero: Adding zero bits before the first one bit in MSB in a 32-
bit vector.
This is computing the preceding number of a bit vector that has zero bits in the
most significant bits (MSB) of the vector. It is often used for electronic digital
display devices as seven-segment display on the devices for example, or for
ascending order of numbers or for preventing fraud in financial documents (Miller,
2004).
IV. Parity: counting the number of 32-bit vector to generate the parity bit.
This is one of the simplest and most popular error detection methods. It could be
used as a special case of CRC, when 1-bit CRC is considered, or it could be used
with other methods such as Hamming Weight to calculate the Hamming distance
as mentioned above, because it uses only a number of XOR gates to calculate it.
As a result, the output vector will include a parity bit at the last significant bit (LSB)
in the 32-bit vector that generate it by using XOR gates in order to indicate
whether the number of bits in the vector is even or odd (Schiller, 2003).
3.3.1 System Definition and Scope
The overall project is comprised of two parts. One is the implementation of a
custom instruction module library, where we implement custom modules for
different operations like CRC, Ones counter, parity etc. The other part is the
implementation of the PR region of the FPGA, which is used to reconfigure the
reconfigurable region according to the requirements.
3.3.2 System Architecture and Components
The overall system is divided into two main regions, the static region and
reconfigurable region, as show in the Figure 14. The static part includes all of the
major logic and the reconfigurable region only includes the custom module. The
MIPS CPU is the main controller processor of the system and it fetches
49
instructions from the instruction ROM. The MIPS CPU decodes the instructions
and performs the desired operations. When the MIPS CPU encounters an
instruction which is not implemented in its datapath, it will start a hardware trap
handler and send the opcode of the desired operation to the trap handler. The trap
handler will look at the opcode and check if the desired instruction is already
loaded into the custom module and performs the operation. If the desired
instruction is not loaded into the custom module, then the configuration manager
inside the trap handler will load the partial bitstream using the ICAP primitive and
hence a new partial bit file will be loaded into the reconfigurable region and then
the operation is performed. The whole process is carried out in hardware to
achieve the lowest latency for reconfiguration.
Reconfigurable RegionStatic Region
MIPS CPU
Trap handler
ICAP
Controller
State
machine Custom module
(reconfigurable)
External
Memory(outside
FPGA)
Instruction
ROM
Figure 14 The final system design.
The system operates on a 50 MHz clock, deriving internally from a top level clock
using Global buffers BUFG to allow accessing of the clock in high speed and to
provide the least amount of skew possible between the MIPS and the peripherals,
connected to the bus that physically located in large distances.
MIPS Soft-Core Processor
The CPU core is based on the MIPS I instruction set and is built in the system as a
soft-core processor. It is used as a platform demonstrator for reconfigurable
50
instruction extensions. Moreover, it is the main module that will control all the
different modules and it will run a trap when the custom instruction exception
occurs. The following Figure 15 illustrates the MIPS overview.
Figure 15 The non-pipelined MIPS shows the most important signals and logics (Fritzell,
2013).
Peripheral component modules:
– Memory RAM: A static memory that provides write-before-read behaviour. In
other words, the data being returned, during a write-cycle, is the same as that
being written. The memory module is synthesised into internal block memories
in the Sparton-6 FPGA architecture. (Doulos.com, 2015).
– GPIO: General-purpose input/output (GPIO) that includes any connection with
an input or output pin. The user at run-time can have control of them. GPIO
pins such as LEDs and switches go OFF by default (Fritzell, 2013).
– ROM: this module will contain the machine code of the instructions, using the
ROM’s address as an index into this memory. The machine code will be
generated with the help of a GCC cross compiler that compiles the C code
and runs the assembly to produce the binary code that can be used in this
array.
51
– UART: universal asynchronous receiver/transmitter (UART). A UART module
can be added to the system. This unit allows the user to control the operation
of the MIPS CPU, the trap handler and other modules and allows them to
check the status of the system. Additionally, the UART module can also be
used to load the configuration required by the ICAP module
– System bus: all modules are connected via a baseline bus protocol, consisting
of: Chip select (CS) input signal, Write enable (WR_en) input signal, Address
input signal, Writedata input signal and Readdata output signal, with the MIPS
as the only master module (Fritzell, 2013).
Configuration controller module:
The Trap handler
The Trap Handler is a core module and is located in the static region of the
FPGA design. The trap handler is directly connected with the MIPS CPU with a
bus, this module can be easily modified such that multiple CPUs can use it to
load the configuration at the desired places and run the operations. Whenever
the MIPS CPU encounters an instruction which is not implemented in its
datapath, then there are two options: either to have a stall or trigger the trap
handler. The trap handler is implemented so as to avoid the malfunction of the
CPU due to the non-implemented instruction.
The ICAP primitive
As we are using Spartan-6 FPGA, the ICAP primitive is used to initiate the
configuration process (called ICAP_SPARTAN6). It is implemented in the
FPGA's fixed logic. This primitive can be used to program the FPGA logic by
user control. Figure 16 shows the interface diagram of the ICAP Spartan-6
primitive and Table 3 gives the detailed description of the input and output
ports of the primitive.
52
ICAP_SPARTAN6
Clk
CE
WRITE
I[15:0]
O[15:0]
Busy
Figure 16: ICAP Primitive (Xilinx Inc, 2015).
Table 3 Descriptions of using ICAP_SPARTAN6 Port (Xilinx Inc, 2015).
Custom modules
There are four custom instructions implemented in the design. The instructions
are: CRC-32, Ones Counter, Parity flag and Leading zero counter. The concept of
each custom instruction is taken from different sources, for example, using the
CRC generator to generate the CRC-32 custom instruction module
(Outputlogic.com, 2015).
Each implemented module is assigned a CUSTOM ID, which makes it
differentiable from the others. More custom instructions can be implemented and
added to the systems by assigning a unique CUSTOM ID to each of the custom
instruction as in Figure 17.
53
The CUSTOM ID is evaluated by the instruction decoder of the MIPS CPU in order
to run the corresponding module or to trigger the configuration process through
the hardware trap handler.
Figure 17 Custom Module.
54
Chapter 4
4 Implementation
This chapter discusses the implementation aspect and the technical issues and
the challenges faced.
4.1 Baseline MIPS Soft-Core
Because the implementation of the soft-core is often sophisticated and comes with
many design files, the implementation of the CPU core in the system has been
done by using the same implementation idea of the MIPS that was proposed by
Fritzell (2013), this is done in one HDL file. It is modified to support dynamically
reconfigurable module.
Pros and Cons
There are several advantages of the simple implementation style for the MIPS
CPU in the system. The MIPS CPU will be of a small size and will run at 50 MHz
and will deliver 50 M instructions per second and that could be more than many
micro-controllers. Moreover, The CPU will trap the custom instruction if it is not
available and return the correct result automatically to the register file; that is, it is
combined with a trap handler that handles the configuration process in a smart
way.
One disadvantage is that the CPU will become an application-specific processor if
the customisation extensions are considered, but the MIPS CPU itself will be just
used as an advanced state machine for the configuration controller. In addition, if
the MIPS is still too large for the application or the application needs to increase
the execution speed then reducing the memory size and removing unused
instructions could be a solution (Yiannacouras, et al., 2006).
One instruction per cycle
Despite the fact that most RISC CPUs are 5 pipelined stage designs, computing
one instruction per cycle in the pipeline stages will require handling hazards in
each stage by adding the corresponding control logic.
55
The non-pipelined MIPS in this project will execute one instruction per cycle
without the need for any hazard detection and handling. The VHDL code, in
Appendix A, of the non-pipelined MIPS highlights the main important blocks which
are: Instruction decoder, Register file, ALU and Program counter. These can
handle the execution of one instruction in one clock cycle. So, the MIPS code has
long propagation delay paths between flip-flops which needs to be minimised in
order to achieve high clock frequency.
In order to allow the execution of one instruction per cycle we use a ‘trick’ to avoid
waiting one clock cycle to get the instruction memory output. The instruction
memory will receive the address of the next instruction just before starting the next
clock edge. As a result, this will make the instruction word available at the
beginning of the current clock cycle. As shown in Figure 18, the next PC address
will pass to the instruction memory instead of the PC, because reading the
instruction from a BRAM should be done synchronous to avoid one clock cycle
reading delay. In this case the time could be affected since the address of the next
instruction that is done after a long linking path that has to meet setup
requirements on the input of the instruction memory
Figure 18 The Program Counter process overview that consists of extra logic and flip-flops
to handle branch and jump instructions. (Fritzell, 2013).
56
Delayed Branch
Delayed Branch is a technique that is applied in order to avoid the effect of control
dependency “hazards” in a pipelined MIPS and it is used in non-pipelined MIPS to
handle branch and jump instructions as already shown in Figure 18. If the branch
is taken, the next instruction that follows the branch address instruction will be
executed before branching or jumping to the new address. By adding extra logic
and flip-flops we can handle the branch address when the delay slot is performing.
Because of that, the MIPS code after the branch or jump instructions often
executes NOPs instructions.
Instruction encoding
The MIPS VHDL code will start when the instruction word is decoded, following
the instruction set encoding that is provided in the MIPS32 instruction set
reference manual (MIPS Technologies, 2003), in order to provide the data that
can be operated on by the ALU. The output result will be stored in the register file.
Multi-cycle instructions
Most instructions are implemented in a straight forward manner; that is, they are
executed in one clock cycle. However, there are some instructions that have a
critical path in the code and they could affect the timing and performance.
Therefore, they should be implemented as multi-cycle instructions. Examples are
signed and unsigned multiplication and division instructions. Because division
instructions could be resource expensive and seldom used, the undefined
instruction will be considered when the div instruction occurs. This is, however, not
a problem as we can add it as a custom instruction, software function or multi-
cycle instruction if we need it.
The multiplication instruction is implemented by enabling the DSP-blocks in the
synthesis tool. This is done in order to take full advantage of device resources and
to increase the performance by allowing the implementation of multiplication in
DSP-blocks. Figure 19 illustrates the multiplication that can operate on extra
registers called HI and LO (div instruction would use HI and LO registers too). This
operation was achieved by using the constraints editor in ISE to constrain the
combinatorial path assignments between the instruction memory output and the HI
and LO registers inputs to allow the path multi-cycle operation in hardware. Also, a
stall signal is used when performing multi-cycle execution in order to stall the
57
MIPS CPU during the execution of the multi-cycle. As result, two clock cycles (or
more if needed) will be performed when the multiplication instructions are
executed. To allow this, we have to prevent the PC and register file from being
updated for one cycle (i.e we have to stall the CPU)
Figure 19 Datapath for the multiplication, allowing two clock cycles for execution.
(Fritzell, 2013).
Trap instructions
When the instruction is available, the result will be returned to the register file.
However, when the instruction is not available but is defined as a custom
instruction, then the MIPS CPU will trap this instruction to be processed by the
trap handler.
4.2 Custom Instruction in Software
Supporting custom instruction in software has been done by changing the GCC
cross compiler for the MIPS architecture. The encoding of the MIPS I instruction
set can be found in C code inside the binutiles that hold the opcodes folder of the
compiler.
The mips-opc.c source file has all the assembly instructions defined in the MIPS I
instructions set in addition to the range of UDIs (user defined instructions). The
format of the UDIs is similar to the format of the R-TYPE instructions that were
defined in section 2.2.3 . Therefore, the UDIs instructions share the same opcode
and are distinguished y the function field, from 0x70000010 to 0x7000001f
instruction word range. In totl, 16 individual user instructions are unused. So, the
designers could add an additional 16 instructions directly to a system.
58
In order to implement the custom instructions in software, the instruction encoding
of any instruction from the UDIs instructions range only should be used. By
exploiting the similarity in the format with R-TYPE instructions, one of the user
defined instructions can be modified to the same R-TYPE format such as XOR
instruction as the following steps (Fritzell, 2013):
Coping the XOR instruction:
{"xor", "d,v,t",0x00000026, 0xfc0007ff,WR_d|RD_s|RD_t,0,I1 },
Choosing any UDI instruction such as the following:
{"udi0", "s,t,d,+1", 0x70000010, 0xfc00003f, WR_d|RD_s|RD_t, 0, I33 },
A small change to the UDI instruction name to be CUSTOM by modifying
the format to be the same XOR will be done:
{"custom", "d,v,t", 0x70000010, 0xfc0007ff, WR_d|RD_s|RD_t, 0, I1 },
As shown in Figure 20 and then recompiling the GCC cross-compiler with
the new custom instruction.
Figure 20 Adding Custom instruction in the compiler.
Then the following inline assembly will be used inside the C code, in order
to call the software implementation of the custom instruction.
__asm__ ("nop\n\t"
"custom %0, %1, %2\n\t"
:"=r" (z)
:"r" (x), "r" (y));
Note: x, y and z are the input operands and the result respectively.
59
4.3 Configuration Controller Modules
I. Trap handler
The Trap handler is the module to handle the exception encountered by the MIPS
CPU. The MIPS CPU reads the instructions from the instruction ROM and then
decodes them. After this, it executes them. In the case that the instruction
received is not implemented in the MIPS CPU, an exception is generated. Then
the MIPS CPU requests the trap handler to handle the exception. The operation of
the trap handler is controlled by a state machine. Figure 21 shows the state
machine diagram for the trap handler.
ST0
ST1
ST2
ST3
Trap_start =1
Trap_start =0
Count = 13
Count < 13
Custom_done = 1
Custom_done = 0
Trap_start =1 &
Opcode =
CUSTOM_ID
Figure 21 Trap Handler State Machine.
There are four states in the state machine. ST0 is the reset state and the system
is normally in this state. Here it waits for the trap start signal, which comes from
the MIPS CPU. When an exception occurs inside the MIPS CPU, it will send the
trap start signal to the trap handler. On the reception of this signal, the state
machine moves to either ST1 or ST2. If the requested Opcode is equal to the
currently loaded CUSTOM ID, then there is no need to load the partial bit file so
the state machine moves to ST2. In the other case, the state machine moves to
ST1, where it sends the command to the ICAP primitive to load the partial bit file
inside the custom module. The configuration process is typically thousands of
cycles so we use a counter in order to monitor the configuration reading signal
from ICAP before going to ST2. At ST2, the trap handler will send a start signal to
the custom module and in ST3, it will wait for it to complete the operation.
60
Each custom module is assigned a unique opcode and the address, which are
given in the below table 4.
Table 4 Custom instructions’ address and ID
II. ICAP primitive
There are multiple ways to use the ICAP controller. If a UART for a connection to
a host machine is considered, then the ICAP controller will be dependent on the
user to initiate a UART transaction for reconfiguring the FPGA. However, the more
automatic way is to load a configuration into SPI flash, and get them from there. In
this project, the latter option is used. An ICAP primitive is instantiated inside the
Trap handler to allow us to load the configuration files so that the reconfigurable
region is reprogrammed to the desired logic. As described by Xilinix Inc
“Spartan-6 FPGAs have dedicated MultiBoot logic, which is used for
both fallback and MultiBoot (IPROG) reconfiguration. When fallback
or IPROG happens, an internally generated pulse resets the entire
configuration logic, except for the dedicated MultiBoot logic. The
IPROG (internal PROGRAM_B) command can be sent through
ICAP_SPARTAN6 or the bitstream” (2015).
Custom Module Name Opcode Address
CRC-32 010000 X"100000"
Ones Counter 100001 X"200000"
Parity 010001 X"300000"
Leading Zero Counter 100000 X"400000"
61
Table 5 An example of bitstream for the IPROG command using ICAP (Xilinx Inc, 2015).
The sequence of command as illustrated in the table above is described in detail
in Spartan-6 FPGA configuration user guide (Xilinx Inc, 2015). After the IPROG
command is sent to the configuration logic, the FPGA will reset everything except
the dedicated reconfiguration logic. Then the bitstream value in the starting
address will be loaded. Thus, the static region is not affected by this operation.
4.4 Custom Instruction in Hardware
Figure 22 Custom Instruction (CI) act as extension of the ALU
OP_
A
OP_
B
CI ALU
ALU_out
Instructio
n
RES
62
Figure 22 shows the MIPS CPU with custom instructions as extensions to the
original ALU. It could take one or two 32-bit input operands and one 32-bit output
is computed. Adding custom instructions to the system can speed up the
execution time of an application as mentioned above. Run-time reconfigurable
accelerator modules in a PR region with a proxy logic approach for the
communication have been implemented using the GoAhead tools.
Figure 23 illustrates the communication between the static and the partial
reconfiguration module. Proxy logic will be used as a connection primitive which is
nothing else than a look up table in route through mode. It acts as a placeholder
for the non-existing part of the system; that is, it replaces the partial module when
implementing the static system and it replaces the static system when
implementing reconfigurable custom instruction accelerator. The same wires are
used for the communication between the static system and the reconfigurable
area.
Figure 23 also shows that the different custom instruction modules use different
logic, but have exactly the same interface to the CPU (including the routing).
Figure 23 On-FPGA Communication for Custom Instructions.
OP_A To
CI OP_B To CI
RES From
CI Custom
Instruction
OP_A From
CPU OP_B From
CPU RES To CPU
CPU
Static Part Partial
Reconfiguration
Proxy
Logic
OP_A To
CI OP_B To CI
RES From
CI Custom
Instruction
OP_A To
CI OP_B To CI
RES From
CI Custom
Instruction
OP_A To
CI OP_B To CI
RES From
CI Custom
Instruction
63
Static System Implementation
A screenshot of the static system is shown in Figure 24. It shows the operand
signals (OP_A, OP_B) in the left side and the result signal is collected at the right.
The amount of wires that are connected from the static part of the system to the
PR region is four for the connection primitive. Consequently, it takes 8 connection
primitives for each of the 32-bit interface signals (OP_A,OP_B and RES).
64
Figure 24 Static implementation
OP
OP
RE
65
Reconfigurable Instructions
Implementing the reconfigurable modules in the absence of the static system is
done as can be seen in the screenshot in the Figure 25. For the partial module
implementation, the same primitive will be used with the other side which is not
connected yet OP_A to CI and OP_B to Ci and RES_from CI. The figure 25 shows
the CRC module connects where the static design ends by the proxy logic. The
custom instruction wrapper has been auto generated by the GoAhead tool.
As the output of the result is not connected to the outside word (i.e. the path ends
at the connection primitives), the FPGA tools would typically remove all logic and
routing to the output primitive. This will eventually result in an empty design to
overcome this, all interface signals were set with a keep attribute (which is specific
to the Xilinx vender tools).
66
Figure 25 Partial Part: the example shows the implementation CRC instruction.
67
Using the GoAhead tool
GoAhead provides a GUI as well as a scripting interface. A screenshot of the tool
is shown in Figure 26. The GUI is typically used to create scripts. The script will
then generate all the constraints that are needed in the system. The generating
constraints for this system are used for two important jobs. The first one is to
prevent the use of the resources in the PR region. In other words, the routing will
be blocked inside the PR region and no logic primitives will be used. Another job is
to create connection primitive placement constraints.
The following steps are used for both implementations (Static and Partial) with
GoAhead as illustrated in the screenshot of the Figure 26:
1. Device description will be loaded
2. Define the region in GoAhead. By selecting the elements between 72 and
79 it is exactly 8 elements which is 8 routes.
3. Place connection macros inside the PR region by using the macro placer in
GoAhead.
4. Create the connection primitive into this area. 4 input wires for connection
primitive that way it creates an area with 8 tiles (i.e. CLBs).
5. All routing inside the PR region will be blocked, except the operands and
result vectors. Then the blocker is exported to the XDL, which is a Xilinx
specific netlist format that is not further investigated in this project.
6. Instantiate the connection macros as in Figure 27. The name of the
primitive is "OP_A connect "and then it has input "OP_A from CPU” as the
VDHL name.
7. Then the constraint file for the design (UCF-file) with placement constraints
for the PR region which is generated by GoAhead should be updated.
In order to generate the bitstream, the static and the partial implementations
should emerge together. It could be done by copying the text description of XDL
netlists and merging them together.
68
Figure 26 GoAhead GUI.
The graphical user
interface of the GoAhead.
Figure 27 GoAhead Script.
69
4.5 Challenges during Implementation
– The three month duration working on this project was a major challenge. In
addition, working on different phases and tools and spend couple of weeks to
learn each one.
– GCC cross complier for MIPS in Windows is not a straightforward task and it
takes time to setup.
– The Nexys3 platform, that hosts the system, does not have external interfaces
such as audio and video which causes limitations in the usability of this
device. Moreover, the difficulty in testing the system was due to the high clock
speed. GPIO and UART are very slow.
– Because the Nexys3 SPI model is not clear and it is not in the documentation,
testing the reconfiguration was difficult. I had to spend a couple of days and
tried to run the code on that board. But later on I had to change the board and
try the code on a new board. I had to change the IO configuration of the board
to run the code.
– Multiboot feature with partial reconfiguration. This is a new approach that
never implemented before I had to go through several literatures and had
spent couple of weeks learning this feature.
– With implementing partial reconfiguration, each design has different names
for the primitives and that way it is not completed
70
Chapter 5
5 Testing, Results and Evaluation
This chapter presents simulations and test of the system, results and evaluation.
5.1 Testing
The whole system is simulated using Test Bench in the Xilinx ISE package. Figure
27 shows the functionality of the MIPS CPU, reading the instructions from ROM
and decoding it, and incrementing the address in the program counter and
executing the branch delay.
a)
b)
71
c)
Figure 28 Test-bench of the MIPS CPU and ROM all pictures above a, b and c are
presenting one test bench that shows different signals for example A) instruction
encoding, decoding and ALU functionalities b) Program counter functionality and c)
branch delay and ROM functionalities.
Modalism Simulation of the custom instruction modules
The simulations for the custom instruction modules were created and the
functionality of different custom instruction modules is verified. Figure 29 shows
the simulation results of the CRC-32 module. Here it can be seen that when
crc_en is high then the CRC-32 is generated and output on the crc_out bus.
72
Figure 29 Modalism Simulation of CRC-32 Module.
Figure (30) shows the simulation results of one counter module. Here it can be
seen that the data is given to the data_in bus and is toggled after the intervals of
the clock and the corresponding output is generated on the output bus.
Figure 30 Modalism simulation of One Counter Module.
Figure (31) shows the simulation output of the parity generation module. Here it
can be seen that data is given into the data_in bus and is changed on the intervals
of the clock and in the result the output is generated on the output bus.
73
Figure 31 Modalism Simulation of Parity generation module.
Figure (32) shows the simulation results of the leading zero counter module. Here
it can be seen that the data is given to the data_in bus and is changed on the
interval of the clock, in the results the output is generated on the output bus.
Figure 32 Modalism simulation of Leading Zero Counter Module.
Modalism Simulation of the Trap handler
For the simulation of trap handler, two simulations are performed. The first
simulation is for the Mux based trap handler and the other simulation is for the
ICAP based trap handler. Figure (33) shows the simulation output of the Mux
74
based trap handler. Here is shown how this module performs when opcode and
data is changed on the input.
Figure 33 Modalism Simulation of Mux Based TrapHandler.
Figure (34) shows the simulation of the trap handler module. Here you can see
that the state machine starts moving after the trap_start signal, then it sends a
command to the ICAP primitive and when it is complete, it starts the custom
module.
Figure 34 Modalism simulation of ICAP based Trap Handler.
75
Software and Testing
The following C code is compiled for MIPS using cross-compiler, the introducing
machine code is used in ROM.
/* read switches and write to leds*/
#define LEDS_BASE_ADDERSS 0x10001000
#define SWS_BASE_ADDERSS 0x10000010
#define RESET_BASE_ADDRESS 0xBFC00000
int main()
{ int temp = 0;
int * RED_LED = (int*)LEDS_BASE_ADDERSS;
volatile* SWITCHES = (int*)SWS_BASE_ADDERSS;
while(1){
temp = *SWITCHES;
if (temp == 8)
*RED_LED = ~0x80;
else if (temp == 7)
*RED_LED = ~0x40;
else if (temp == 6)
*RED_LED = ~0x20;
else if (temp == 5)
*RED_LED = ~0x10;
else if (temp == 4)
*RED_LED = ~0x08;
else if (temp ==3)
*RED_LED=~0x04;
else if (temp ==2)
* RED_LED=~0x02;
else if (temp ==1)
* RED_LED=~0x01;
else
*RED_LED=~0x00;
} return 0; }
76
Test reconfigurable modules:
Testing the reconfiguration process is done by selectively uploading the
configuration bitstreams for the four custom instructions and the difference
bitstream into the SPI storage by using iMPACT as illustrated in Tapp (2010). The
documentation from Xilinx (Configuring Xilinx FPGAs with SPI Serial Flash) shows
the steps in details. Each configuration bitstream is assigned to a specific region.
In this project, the Nexys3 development board should be used for the test as it is
the host of our system. However, because the Nexys3 SPI model port was difficult
to read and not clear in its documentation, Atlys development board is used for
this test.
Because the Nexys3 board does not have the external interfaces such as audio
and video, the only input and output result for this system is the GPIO module,
Moreover, a UART module is not considered. Therefore, testing using only the
GPIO module was not convenient due to the high clock speed. One way to do the
test is by implementing the ICAP_SAPRTAN-6 with the multiboot feature with a
simple system such as the basic logic gates (AND, XOR, NOR…etc.) that can
read the switches as inputs, using the logic to display the output as LEDs to see
the differences when the module is changed to another logic module.
5.2 Results
The cost of the system resources for the first approach and the cost of the system
resources for the final system approach are outlined in Table 6.
Approach Nr of LUT Nr.of Slices Latency
MUX based trap handler 246 1798 20.011ns
ICAP based trap handler 438 1370 18.125ns
Table 6 Resource requirements for Configuration controller.
The results reveal that when using a MUX based trap handler. The system used
less look up tables than the ICAP-based trap handler due to the simpler datapath
in the ICAP variant. However, the slice resources that are used in the MUX-based
trap handler system will be more than those used in the ICAP-based trap handler
77
because the system uses more logic for the custom modules. Finally, the latency
is higher in the case of the system that is based on the MUX-trap handler because
of the trap overhead. However, in the ICAP system, unless one custom instruction
is configured in the system and only in the case of the custom instruction not the
desired one then the reconfiguration will be considered. Note that the delay in this
table is for the whole implementation without considering the reconfiguration
overhead. Only by introducing the ICAP-based trap handler, were we able to run
the system at the target 50 MHz clock frequency
For the custom modules, the following table shows the cost of the resources.
Custom
Module
Nr. Of
LUT
Nr.of
Slices
Latency
(Max/av)ns
Bitstream
size (KB)
CRC 32 43 18 8.038/3.597 282
Counting One 39 19 15.717/14.35 263
Leading Zero 19 15 9.723/3.597 293
Parity (XOR) 7 6 3.618/3.597 282
Table 7 Resource requirements for Custom modules.
The results in Table 7 show the implementation costs for the custom instructions.
In the progress report, manual code optimization was performed in order to see if
the tools recognize the optimization by itself or not and the result shows the tools
do not do that. This point was considered when we implemented the custom
module. Therefore, the result shown in Table 7 shows the better use of the
resources, delay and bitstream size for each custom module after manually
optimising each module.
5.3 Evaluation
System performance
The whole system, including the configuration controller, can run at a system clock
of 50 MHz. The first of the two biggest limitation factors is that the MIPS CPU runs
a trap when a custom instruction exception occurs and traps have a tiny additional
78
overhead which would not occur in case of a baseline MIPS implementation. The
second factor is that the trap handler represents the configuration controller, which
uses external flash memory.
For partial reconfiguration, one important benchmark is the response time that has
to be considered for the reconfiguration process. Swapping instructions will
obviously take a significant amount of time for loading the corresponding partial
bitstream from an external SPI memory to the device. Moreover, the bitstream size
would affect the speed of the configuration module.
In Fritzell (2013), the configuration controller for module relocation was designed
to use two clocks, one clock running at 50 MHz for the part that was connected to
the bus and the other one running at 100 MHz for the part that handled the
configuration process. In our system, the trap handler will run at 50 MHz, which
could slow down the configuration speed. Moreover, in Fritzell (2013) a
decompression module is used to decompress the configuration data on the
FPGA for faster reconfiguration. So, our predicted result of the reconfiguration
time could be lower than what is achieved in that work.
However, there are some techniques that could be applied to optimize the
performance and cost in the system on the FPGA device. In this project, we used
the FPGA MultiBoot feature that is slow, but that uses a serial configuration
memory chip that is underutilized in most FPGA prototyping systems. This also
separates the configuration bitstream storage from other memory which improves
the security of the system.
Performance Enhancing Techniques
General speaking, performance techniques could be divided into: techniques that
are not FPGA specific from compiler and memory usage to name a few; and
techniques that are FPGA specific, such as increasing the operating frequency. As
a rule of thumb, since optimizing configuration speed is a typical goal, an entire
program should rarely be targeted at external memory (Fletcher, 2005) if so, then
the use of another clock should be considered in order to handle the process
faster than it would be.
79
Comparing the system to a real-world system:
Processor Processor Type Device Family used Speed(MHz)
Achieved
PowerPCTM 405 hard Vritex-4 450
MicroBlaze soft Vritex-II Pro 150
MicroBlaze soft Spartan-3 85
MIPS soft Spartan-6 50
Table 8: comparison between Xilinx Embedded processors with our soft-core and their
Performance.
The available embedded processors with the manufacturers quoted maximum
frequency and our soft-core, included the extension with its maximum frequency
are summarized in the Table 8. Despite the MIPS processor being the slowest in
that table, it might outperform the others due to the use of custom instructions.
Hardware acceleration
A soft-core on the FPGA will allow the designer to make a trade-off between
hardware and software in order to maximize efficiency and performance. If there is
a software function identified as a software bottleneck, then a custom module can
be designed for this function in the FPGA. The device will then act as a co-
processor or, as in our case, as a custom instruction extension to the soft-core
processor.
One way to evaluate custom instructions in hardware implementation is to
compare them against software implementations of the functions running on the
standard ISA of the MIPS CPU. The software functions that are used as a
reference can be found on (Andersen, 2005). Software evaluation for those four
functions, which are written in C code, is compiled for the MIPS using a GCC
cross-cross compiler. Using disassembly for the code in order to calculate how
many instructions each function is consuming. Table 9 shows how many CPU
instructions are saved by using a custom instruction.
80
Software function Instructions
CRC 262
Hamming weight 262
Leading Zero 294
Parity (XOR) 263
Table 9 Software requirements.
81
Chapter 6
6 Conclusions and Future Work
6.1 Conclusions
The system is improved through the lifecycle that is presented in the methodology.
The final system after all improvements had been done meets the objectives
outlined in the introduction chapter. Moreover, learning the concepts and the
fundamental features of FPGAs step by step is the biggest achievement. The
previous chapters described those concepts in detail, the necessary components
and tools and the implementation of a fully functional PR system. The dynamically
run-time reconfigurable custom instruction set extension of a MIPS CPU can be
replaced in the system. The most important part of the implemented system are:
1. MIPS CPU.
2. Trap handler, included ICAP primitive.
3. The exploitation of the MultiBoot feature for the full and partial
reconfiguration.
6.2 Future Work
There are some improvements that can be done to the final implemented system
and together these could be considered as the requirement analysis stage for the
next lifecycle.
In this project Nexys3 has been used as a platform. However, the lack of
external interfaces caused limitations in the usability of this device. Using
another academic board which includes audio and video then could show
the input and the output of the system and could design a complete digital
system built around soft-core processor.
The MIPS CPU that is used as soft-core is a very simple processor, is non-
pipelined and uses BRAM as both program memory and data memory.
These could be improved by implementing a pipelined processor also by
implementing a simple cache controller that could be connected to DDR-
82
memory. As a result of this, executing larger programs and storing large
data structures such as frame buffers could be possible.
The system uses the MultiBoot feature and the command sequence that is
sent through the ICAP primitive to support the read-back of configuration
data from ICAP. However, there are two different ways for reading and
writing the configuration data from ICAP. As illustrated in (Fritzell, 2013)
“Either clock is left toggling and clock enable is used to control throughput,
or clock enable is kept high and the clock signal is controlled to achieve
wanted throughput” with implementing ICAP interface.
Adding more advanced modules for communication over COM-port.
Measuring the clock cycle of the reconfiguration by using Log with a
counter in the trap handler in order to reflect the number of clock cycles
from the time the counter starts until it is stopped.
The Nexus3 board has a seven segment electrical screen; it could be
exploited for testing.
Different benchmarks could be used to evaluate the soft-core on the FPGA.
The most standard benchmark is Dhrystone MIPs (DMIPs) and the result
from this could then be compared with the results we achieved with our
system.
83
Works Cited
Andersen, S. E., 2005. Bit Twiddling Hacks. [Online]
Available at:
http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
[Accessed 31 August 2015].
Beckhoff, C., Koch, D. & Torresen, J., 2012. Go ahead: A partial reconfiguration
framework. Field-Programmable Custom Computing Machines (FCCM), 2012
IEEE 20th Annual International Symposium, pp. 37-44.
Bibda, C., 2007. Introduction to Reconfigurable Computing: Architectures,
Alhorithims and Applications. s.l.:Springer.
Bobda, C., 2007. Introduction to Reconfigurable Computing: Architectures,
Algorithims, and Applications. s.l.:Springer.
Bobda, C., 2008. Introduction to Reconfigurable Computing. Netherlands:
Springer .
Digilent, 2013. Nexys3™ Board Refference Manual. [Online]
Available at: https://www.digilentinc.com/Data/Products/NEXYS3/Nexys3_rm.pdf
Doulos.com, 2015. Simple Ram Model. [Online]
Available at:
https://www.doulos.com/knowhow/vhdl_designers_guide/models/simple_ram_mod
el/
[Accessed 7 August 2015].
Elkateeb, A., 2011. A Processor Design Course Project: Creating Soft-Core MIPS
Processor Using Step-by-Step Components' Integration Approach. International
Journal of Information and Education Technology, 1(5), pp. 432-440.
Fletcher, B., 2005. FPGA Embedded Processors Revealing True System
Performance. In: Embedded Training Program Embedded Systems Conference..
[Online]
Available at:
http://www.xilinx.com/products/design_resources/proc_central/resource/ETP-
367paper.pdf
[Accessed 14 August 2015].
84
Fritzell, A., 2013. A System for Fast Dynamic Partial Reconfiguration using
GoAhead Design and Implementation.. Masters Thesis: University of Oslo.
Galuzzi, C. & Bertels, K., 2011. The Instruction-Set Extension Problem: A Survey.
ACM Transactions on Reconfigurable Technology and Systems. article 18, 4(2).
Gebotys, C. H., 2012. A network flow approach to memory bandwidth utilization in
embedded DSP core processors. IEEE Transactions On Very Large Scale
Integration (Vlsi) Systems, 10(4), pp. 390-398.
Hansen, S. G., Koch, D. & Torresen, J., 2011. High speed partial runtime
reconfiguration using enhanced icap hard macro. In: Parallel and Distributed
Processing Workshops and Icap hard macro. Shanghai: IEEE, pp. 174-180.
Hauck, S., 1998. Configuration prefetch for single context reconfigurable
coprocessors. In: Proceedings of the 1998 ACM/SIGDA sixth international
symposium on Field programmable gate arrays. New York: ACM, pp. 65-74.
Hauck, S. & Wilson, W. D., 1999. Run Length Compression Techniques for FPGA
Configurations. Napa Valley, IEEE.
Jo, J., 2013. 6 Basic Phases of Software Development Life Cycle (SDLC). [Online]
Available at: http://www.techknol.net/2013/04/software-development-life-cycle.html
[Accessed 15 August 2015].
Koch, D., 2013. Partial Reconfiguration on FPGAs: Architectures, Tools and
Applications. New York: Springer.
Koch, D., Beckhoff, C. & Torreson, J., 2010. Zero logic overhead integration of
partially reconfigurable modules. Proceedings of the 23rd symposium on
Integrated circuits and system design, pp. 103-108.
Kozyrakis, C. E. & Patterson, D. A., 2004. Scalable, vector processors for
embedded systems. Micro, IEEE, 23(6), pp. 36-45.
Kuon, I. & Rose, J., 2007. Measuring the Gap Between FPGAs and ASICs.. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems,
26(2), pp. 203-215.
Lysaght, P. & Subrahmanyam, P. A., 2005. Guest Editors’ Introduction: Advances
in Configurable Computing. EEE Design & Test of Computers, 22(2), pp. 85-89.
85
Miller, J., 2004. The Chicago guide to writing about numbers. Chicago: University
of Chicago Press.
Minev, P. B. & Kukenska, V. S., 2007. Implemenation of Soft-core Processors in
FPGAs. Gabrovo, International Scientific Conference.
MIPS Technologies, 2003. MIPS32™ Architecture For Programmers Volume II:
The MIPS32™ Instruction Set. [Online]
Available at: http://www.cs.cornell.edu/courses/cs3410/2008fa/mips_vol2.pdf
[Accessed 3 August 2015].
OutputLogic.com, 2013. OutputLogic.com. [Online]
Available at: http://outputlogic.com/
[Accessed 30 August 2015].
Pittman, R. N., Lynch, N. L. & Forin, A., 2006. eMIPS, A Dynamically Extensible
Processor, Redmond: Microsoft Research.
Synopsys, 2010. SiliconBlue Selects Synopsys as FPGA Synthesis Partner for Its
iCE65 mobileFPGA Family. [Online]
Available at: http://news.synopsys.com/index.php?s=20295&item=123144
[Accessed 30 March 2015].
Tapp, S., 2010. Configuring Xilinx FPGAs with SPI Serial Flash. 1st ed. [ebook]
Xilinx.Inc.. [Online]
Available at:
http://www.xilinx.com/support/documentation/application_notes/xapp951.pdf
[Accessed 1 September 2015].
Wold, A., Koch, D. & Torresen, J., 2012. Design techniques for increasing
performance and resource utilization of reconfigurable soft CPUs. s.l., IEEE, pp.
50-55.
Xilinx Inc, 2011. Spartan-6 FPGA Block RAM Re-sources User Guide. [Online]
Available at: http://www.xilinx.com/support/documentation/user_guides/ug383.pdf
[Accessed 1 August 2015].
Xilinx Inc, 2015. Spartan-6 FPGA Configuration User Guide. [Online]
Available at: http://www.xilinx.com/support/documentation/user_guides/ug380.pdf
[Accessed 11 August 2015].
86
Xilinx, 2012. Partial Configuration User Guide. [Online]
Available at:
http://www.xilinx.com/support/documentation/sw_manuals/xilinx14_1/ug702.pdf
[Accessed 1 August 2015].
Xilinx, 2013. ISE Design Suite. [Online]
Available at: http://www.xilinx.com/products/design-tools/ise-design-suite.html
[Accessed 1 May 2015].
Yiannacouras, P., Steffan, J. G. & Rose, J., 2006. Application-Specific
Customization of Soft Processor Microarchitecture. Proceedings of the 2006
ACM/SIGDA 14th international symposium on Field programmable gate arrays,
pp. 201-210.
87
Appendix A - MIPS CPU
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
use std.textio.all;
entity MIPS_CPU is
port (
clk : in std_logic;
reset : in std_logic;
WaitRequest : in std_logic;
D_write_en : out std_logic;
D_read_en : out std_logic;
I_ADR : out std_logic_vector (31 downto 0);
I_DATA : in std_logic_vector (31 downto 0);
D_ADR : out std_logic_vector (31 downto 0);
D_W_DATA : out std_logic_vector (31 downto 0);
D_R_DATA : in std_logic_vector (31 downto 0);
RES_0 : in std_logic_vector(31 downto 0);
opCode : out std_logic_vector(5 downto 0);
OP_A_c : out std_logic_vector(31 downto 0);
OP_B_c : out std_logic_vector(31 downto 0);
trap_start : out std_logic;
OP_A : out std_logic_vector(31 downto 0);
OP_B : out std_logic_vector(31 downto 0));
end MIPS_CPU;
architecture a_MIPS_CPU of MIPS_CPU is
type Instruction_type_type is (Undefined,R_type, ADDI, ADDIU, SLTI,
SLTIU, ANDI, ORI, XORI, LUI, J, BNE, BEQ, load, store, JAL, BRANCHES,
BGTZ, BLEZ, I_type_special_2);
type PC_type is (Normal, branchs,Jumbs);
signal Instruction_type : Instruction_type_type;
signal PCstate : PC_type;
signal local_D_write_en : std_logic;
signal rs, rt, rd, sa : std_logic_vector(4 downto 0);
signal W_ADR : std_logic_vector(4 downto 0);
signal R_DATA_A, R_DATA_B : std_logic_vector(31 downto 0);
signal ALU_out : std_logic_vector(31 downto 0);
signal ALU_out64 : std_logic_vector(63 downto 0);
signal PC, PC4, nextPC, branchPC,jumbpc : std_logic_vector(31 downto
0);
signal RegFile_en : std_logic;
signal instr : std_logic_vector(5 downto 0);
signal funct : std_logic_vector(5 downto 0);
signal immediate, SL2immediate : std_logic_vector(31 downto 0);
signal immediateU : std_logic_vector(31 downto 0);
signal immediateJ : std_logic_vector(27 downto 0);
signal BranchTaken, branching : std_logic;
signal JumpTaken, JumpTakenJR : std_logic;
signal idata : std_logic_vector(31 downto 0);
signal HI, LO : std_logic_vector(31 downto 0);
signal WaitRequest_i : std_logic;
signal WaitRequest_comb : std_logic;
signal mul_wait : std_logic;
signal MTHI, MTLO : std_logic;
signal mul_taken : std_logic;
type memtype is array (31 downto 0) of std_logic_vector(31 downto 0);
signal RegFile : memtype := (others => (others => '0'));
88
begin
OP_A_c <= R_DATA_A;
OP_B_c <= R_DATA_B;
opCode <= funct;
--------------------------------------
--------REGISTER_FILE ---------------
-------------------------------------
p_write : process (clk)
begin
if clk'event and clk = '1' then
if WaitRequest_comb = '1' then
if RegFile_en = '1' and (W_ADR /= (W_ADR'range => '0')) then
if (Instruction_type = load) then
RegFile(to_integer(unsigned(W_ADR))) <= D_R_DATA;
else
RegFile(to_integer(unsigned(W_ADR))) <= ALU_out;
end if;
end if; -- RegFileEnable
end if; --Waitrequest
end if; --clk
end process;
R_DATA_A <= RegFile(to_integer(unsigned(rs)));
R_DATA_B <= RegFile(to_integer(unsigned(rt)));
OP_A <= R_DATA_A;
OP_B <= R_DATA_B;
----------------------------------------------
------------INSTRUCTION_DECODER----------------
-----------------------------------------------
--setting idata to correct signals:
idata <= I_DATA;
funct <= idata(5 downto 0);
instr <= idata(31 downto 26);
rs <= idata(25 downto 21);
rd <= idata(15 downto 11);
rt <= idata(20 downto 16);
sa <= idata(10 downto 6);
--Immediate sign extended:
immediate(31 downto 16) <= (others => idata(15));
immediate(15 downto 0) <= idata(15 downto 0);
--Immediate unsigned:
immediateU <= x"0000" & idata(15 downto 0);
--Jump offset:
immediateJ <= idata(25 downto 0) & "00";
--Immediate sign extended and leftshift 2:
SL2immediate <= immediate(29 downto 0) & "00";
--Decoding instructions:
p_INS_DECOER : process (instr)
89
begin
----------R_TYPE
if (std_match(instr, "000000")) then Instruction_type <=
R_type;
elsif (std_match(instr, "011100")) then Instruction_type <=
I_type_special_2; -- I-type instruction SPECIAL 2 custom instruction
---------I_TYPE
elsif (std_match(instr, "001001")) then Instruction_type <= ADDIU;
elsif (std_match(instr, "001001")) then Instruction_type <= ADDIU;
elsif (std_match(instr, "001000")) then Instruction_type <= ADDI;
elsif (std_match(instr, "001011")) then Instruction_type <= SLTIU;
elsif (std_match(instr, "001100")) then Instruction_type <= ANDI;
elsif (std_match(instr, "001101")) then Instruction_type <= ORI;
elsif (std_match(instr, "001110")) then Instruction_type <= XORI;
elsif (std_match(instr, "001111")) then Instruction_type <= load;-
-LUI
elsif (std_match(instr, "001010")) then Instruction_type <= SLTI;
-- slti
elsif (std_match(instr, "101011")) then Instruction_type <= store; -
- store instruction
elsif (std_match(instr, "101000")) then Instruction_type <= store; -
- store byte instruction
elsif (std_match(instr, "100011")) then Instruction_type <= load; --
load instruction
-- elsif (std_match(instr, "100000")) then Instruction_type <= load;
-- load byte instruction
---------BRANCHES
elsif (std_match(instr, "000100")) then Instruction_type <= BEQ;
elsif (std_match(instr, "000101")) then Instruction_type <= BNE;
elsif (std_match(instr, "000111")) then Instruction_type <= BGTZ;
elsif (std_match(instr, "000110")) then Instruction_type <= BLEZ;
elsif (std_match(instr, "000001")) then Instruction_type <=
BRANCHES; -- BLTZ,BGEZ,BGEZAL,BLTZAL
--------J_TYPE
elsif (std_match(instr, "000010")) then Instruction_type <= J; --
jump instruction
elsif (std_match(instr, "000011")) then Instruction_type <= JAL; --
jal (jump and link)
else Instruction_type <= Undefined;
report " +++ unimplemented instruction type !! ";
end if;
end process;
----------------------------------------------
WaitRequest_i <= '0' when (mul_taken = '1' and mul_wait = '0') else
'1';
WaitRequest_comb <= WaitRequest and WaitRequest_i;
----------------------for multiplication instructions 2 cycle
InstMulreg : process(clk)
begin
if rising_edge(clk) then
if MTHI = '1' then
HI <= R_DATA_A;
elsif mul_wait = '1' then
HI <= ALU_out64(63 downto 32);
end if;
if MTLO = '1' then
LO <= R_DATA_A;
elsif mul_wait <= '1' then
90
LO <= ALU_out64(31 downto 0);
end if;
if mul_taken = '1' and mul_wait = '0' then
mul_wait <= '1';
else
mul_wait <= '0';
end if;
end if;
end process;
-------------------------for sending the trap in case of custom
instructions
process(Instruction_type, funct)
begin
if(Instruction_type = I_type_special_2)then
if (funct = "010000" or funct = "010001" or funct =
"100000" or funct = "100001") then --I:CUST
trap_start <= '1';
else
trap_start <= '0';
end if;
else
trap_start <= '0';
end if;
end process;
-------------------------------------------
------------- ALU -------------------------
--------------------------------------------
D_write_en <= local_D_write_en;
D_ADR <= ALU_out;
D_W_DATA <= R_DATA_B;
------
p_ALU: process (PC, hi, lo, WaitRequest_comb ,RES_0,
Instruction_type, funct, instr, rt, rd, rs, sa, immediate, immediateU,
SL2immediate, R_DATA_A, R_DATA_B, ALU_out, ALU_out64, W_ADR)
begin
--initialising values:
ALU_out <= (others => '0');
ALU_out64 <= (others => '0');
JumpTaken <= '0';
BranchTaken <= '0';
W_ADR <= (others => '0');
RegFile_en <= '0';
D_read_en <= '0';
local_D_write_en <= '0';
JumpTakenJR <= '0';
MTHI <= '0';
MTLO <= '0';
mul_taken <= '0';
case Instruction_type is
when R_type => RegFile_en <= '1';
W_ADR <= rd;
case funct is
91
when B"00_00_00" => ALU_out <=
std_logic_vector(unsigned(R_DATA_B) SLL to_integer(unsigned(sa))); --
I:SLL
when B"00_00_10" => ALU_out <=
std_logic_vector(unsigned(R_DATA_B) SRL to_integer(unsigned(sa))); --
I:SRL
when B"00_01_10" => ALU_out <=
std_logic_vector(unsigned(R_DATA_B) SRL to_integer(unsigned(R_DATA_A)));
--I:SRLV
when B"00_01_00" => ALU_out <=
std_logic_vector(unsigned(R_DATA_B) SLL to_integer(unsigned(R_DATA_A)));
--I:SLLV
when B"00_00_11" => ALU_out <=
std_logic_vector(signed(R_DATA_B) SRL to_integer(unsigned(sa))); --I:SRA
when B"00_01_11" => ALU_out <=
std_logic_vector(signed(R_DATA_B) SRL to_integer(unsigned(R_DATA_A))); --
I:SRAV
when B"10_10_10" => if signed(R_DATA_A) < signed(R_DATA_B)
then --I:SLT
ALU_out <= x"00000001"; --I:SLT
else --I:SLT
ALU_out <= (others => '0'); --I:SLT
end if; --I:SLT
when B"10_10_11" => if unsigned(R_DATA_A) <
unsigned(R_DATA_B) then --I:SLTU
ALU_out <= x"00000001"; --I:SLTU
else --I:SLTU
ALU_out <= (others => '0'); --I:SLTU
end if; --I:SLTU
when B"10_00_01" => ALU_out <=
std_logic_vector(unsigned(R_DATA_A) + unsigned(R_DATA_B)); --I:ADDU
when B"10_00_00" => ALU_out <=
std_logic_vector(signed(R_DATA_A) + signed(R_DATA_B)); --I:ADD
when B"10_00_10" => ALU_out <=
std_logic_vector(signed(R_DATA_A) - signed(R_DATA_B)); --I:SUB
when B"10_00_11" => ALU_out <=
std_logic_vector(unsigned(R_DATA_A) - unsigned(R_DATA_B)); --I:SUBU
when B"10_01_00" => ALU_out <= R_DATA_A and R_DATA_B; --I:AND
when B"10_01_01" => ALU_out <= R_DATA_A or R_DATA_B; --I:OR
when B"10_01_10" => ALU_out <= R_DATA_A xor R_DATA_B; --I:XOR
when B"10_01_11" => ALU_out <= R_DATA_A nor R_DATA_B; --I:NOR
when B"01_00_00" => ALU_out <= HI; --I:MFHI
when B"01_00_10" => ALU_out <= LO; --I:MFLO
when B"01_00_01" => MTHI <= '1'; --I:MTHI
when B"01_00_11" => MTLO <= '1'; --I:MTLO
when B"00_10_00" => JumpTakenJR <= '1'; --I:JR
RegFile_en <= '0'; --I:JR
when B"00_10_01" => ALU_out <= std_logic_vector(unsigned(PC)
+ 8); --I:JALR
JumpTakenJR <= '1'; --I:JALR
when B"00_10_11" => ALU_out <= R_DATA_A; --I:MOVN
if R_DATA_B = x"00000000" then --I:MOVN
RegFile_en <= '0'; --I:MOVN
end if; --I:MOVN
when B"00_10_10" => ALU_out <= R_DATA_A; --I:MOVZ
if R_DATA_B /= x"00000000" then --I:MOVZ
RegFile_en <= '0'; --I:MOVZ
end if; --I:MOVZ
when B"01_10_00" => mul_taken <= '1';
ALU_out64 <=
std_logic_vector(signed(R_DATA_A) * signed(R_DATA_B));
92
when B"01_10_01" => mul_taken <= '1';
ALU_out64 <=
std_logic_vector(unsigned(R_DATA_A) * unsigned(R_DATA_B));
when others => report " +++ unimplemented instruction type !!
";
end case;
----------------------------------------------I_TYPE
when ADDIU => RegFile_en <= '1'; --I:ADDIU
W_ADR <= rt; --I:ADDIU
ALU_out <= std_logic_vector(unsigned(R_DATA_A) +
unsigned(immediate)); --I:ADDIU
when ADDI => RegFile_en <= '1'; --I:ADDI
W_ADR <= rt; --I:ADDI
ALU_out <= std_logic_vector(signed(R_DATA_A) +
signed(immediate)); --I:ADDI
when SLTIU =>RegFile_en <= '1'; --I:SLTIU
W_ADR <= rt; --I:SLTIU
if unsigned(R_DATA_A) < unsigned(immediateU) then --
I:SLTIU
ALU_out <= (0 => '1', others => '0'); --I:SLTIU
else --I:SLTIU
ALU_out <= (others => '0'); --I:SLTIU
end if; --I:SLTIU
when SLTI =>RegFile_en <= '1'; --I:SLTI
W_ADR <= rt; --I:SLTI
if signed(R_DATA_A) < signed(immediate) then --I:SLTI
ALU_out <= (0 => '1', others => '0'); --I:SLTI
else --I:SLTI
ALU_out <= (others => '0'); --I:SLTI
end if; --I:SLTI
when ANDI =>RegFile_en <= '1'; --I:ANDI
W_ADR <= rt; --I:ANDI
ALU_out <= R_DATA_A and immediateU; --I:ANDI
when ORI =>RegFile_en <= '1'; --I:ORI
W_ADR <= rt; --I:ORI
ALU_out <= R_DATA_A or immediateU; --I:ORI
when XORI =>RegFile_en <= '1'; --I:XORI
W_ADR <= rt; --I:XORI
ALU_out <= R_DATA_A xor immediateU; --I:XORI
-------------------------- load instruction
when load =>
RegFile_en <= '1';
W_ADR <= rt;
case instr is
when B"10_00_11" =>
ALU_out <=
std_logic_vector(signed(immediate) + signed(R_DATA_A)); --I:LW
local_D_write_en <= '0';
D_read_en<=
'1'; --I:LW
when B"10_00_00" =>
ALU_out <=
std_logic_vector(signed(immediate) + signed(R_DATA_A)); --I
local_D_write_en <= '0';
D_read_en <=
'1'; --I
when B"00_11_11" =>
93
ALU_out <= immediate(15 downto 0) &
X"0000"; --I:LUI
local_D_write_en <= '0';
when others => report " +++ unimplemented load instruction !!
";
end case;
------------------------------------ store instruction
when store =>
case instr is
-- ALU_out == address
-- address = memory[base+offset], base 25-21, offset 15-0
when B"10_10_11" => ALU_out <=
std_logic_vector(signed(immediate) + signed(R_DATA_A)); --I:SW
local_D_write_en <= '1';
when B"10_10_00" => ALU_out <=
std_logic_vector(signed(immediate) + signed(R_DATA_A)); --I:SW
local_D_write_en <= '1';
report " +++ store byte executed as store
word !! ";
when others => report " +++ unimplemented store instruction !!
";
end case;
---------------------------------------------JAMP
when J =>JumpTaken <= '1'; --I:J
when JAL =>RegFile_en <= '1'; --I:JAL
W_ADR <= "11111"; --I:JAL
ALU_out <= std_logic_vector(unsigned(PC) + 8); --I:JAL
JumpTaken <= '1'; --I:JAL
------------------------------------------CUSTOMS
when I_type_special_2 =>
RegFile_en <= '1'; --I:?
W_ADR <= rd; --I:?
if (funct = "010000" or funct = "010001" or funct = "100000"
or funct = "100001") then --I:CUST
ALU_out <= RES_0; --I:CUST
report " +++ not custom instruction type !! ";
end if;
-------------------------------------------BRANCHES
when BNE => if R_DATA_A /= R_DATA_B then --I:BNE
BranchTaken <= '1'; --I:BNE
end if; --I:BNE
when BEQ => if R_DATA_A = R_DATA_B then --I:BEQ
BranchTaken <= '1'; --I:BEQ
end if; --I:BEQ
when BGTZ =>if signed(R_DATA_A) > x"00000000" then --I:BGTZ
BranchTaken <= '1'; --I:BGTZ
end if; --I:BGTZ
when BLEZ =>if signed(R_DATA_A) <= x"00000000" then --I:BLEZ
BranchTaken <= '1'; --I:BLEZ
end if; --I:BLEZ
when BRANCHES => if rt = "00000" then
if signed(R_DATA_A) < x"00000000" then --I:BLTZ
BranchTaken <= '1'; --I:BLTZ
end if;
94
elsif rt = "00001" then --I:BGEZ
if signed(R_DATA_A) >= x"00000000" then --
I:BGEZ
BranchTaken <= '1'; --I:BGEZ
end if; --I:BGEZ
elsif rt = "10001" then --I:BGEZAL
W_ADR <= "11111"; --I:BGEZAL
ALU_out <= std_logic_vector(unsigned(PC) +
8); --I:BGEZAL
if signed(R_DATA_A) >= x"00000000" then --
I:BGEZAL
BranchTaken <= '1'; --I:BGEZAL
end if; --I:BGEZAL
elsif rt = "10000" then --I:BLTZAL
W_ADR <= "11111"; --I:BLTZAL
ALU_out <= std_logic_vector(unsigned(PC) +
8); --I:BLTZAL
if signed(R_DATA_A) <= x"00000000" then --
I:BLTZAL
BranchTaken <= '1'; --I:BLTZAL
end if; --I:BLTZAL
end if;
-------------------------------------------
when Undefined =>report " +++ undefined instruction !! ";
when others =>report " +++ unimplemented instruction type !! ";
end case;
end process;
-----------------------------------------------
-------------------PROGRAM-COUNTER-------------
-----------------------------------------------
--Immediate sign extended and leftshift 2:
SL2immediate <= immediate(29 downto 0) & "00";
nextPC <= PC4 when PCstate= Normal else
branchPC when PCstate= branchs else
JumbPC when
PCstate=Jumbs else
PC4 ;
------------------------------------------
I_ADR <= nextPC when WaitRequest_comb = '1' else PC;
PC4 <= std_logic_vector(unsigned(PC) + 4);
process (clk)
begin
if clk'event and clk = '1' then
if WaitRequest_comb = '1' then
if reset = '1' then
PCstate <= Normal;
PC <= X"BFC00000"; ---MIPS reset address
branchPC <= X"BFC00000"; ---MIPS reset address
JumbPC <= X"BFC00000"; ---MIPS reset address
else
PC <= nextPC;
---------------------------
case PCstate is
-- "Normal" state of PC:
when Normal =>
--If a branch is taken:
if BranchTaken = '1' then
95
PCstate <= branchs;
branchPC <= std_logic_vector(signed(PC4) +
signed(SL2immediate));
--If a jump is taken:
elsif JumpTaken = '1' then
PCstate <= Jumbs;
JumbPC <= PC4(31 downto 28) & immediateJ;
-- If a jump from register is taken:
elsif JumpTakenJR = '1' then
PCstate <= Jumbs;
JumbPC <= R_DATA_A;
else
PCstate <= Normal;
end if;
-- branch and jumb state of PC:
when branchs =>
PCstate <= Normal;
when Jumbs=>
PCstate<=Normal;
when others =>
PCstate <= Normal;
end case;-- case PCstate
--------------------------------
end if;--rest
end if;--wait
end if;--clk
end process;
-------------------------------
end;
96
Appendix B - Trap handler based on MUX
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
----------------------------------------------
entity trapHandler is
Port (
clk : IN std_logic;
-- clk100 : in std_logic;
reset : IN std_logic;
address : IN std_logic_vector(31 DOWNTO 0);
opcode : in std_logic_vector (5 downto 0);
writedata : IN std_logic_vector(31 DOWNTO 0);
commandIn : in STD_LOGIC_VECTOR (31 downto 0);
readdata : OUT std_logic_vector(31 DOWNTO 0);
WaitRequest : in std_logic
);
end trapHandler;
architecture Behavioral of trapHandler is
component custom1_module is
port ( data_in : in std_logic_vector (31 downto 0);
crc_en , reset, clk : in std_logic;
crc_out : out std_logic_vector (31 downto 0));
end component;
------------------------------------------
component custom2_module is
port ( data_in : in std_logic_vector (31 downto 0);
one_out : out std_logic_vector (31 downto 0));
end component;
------------------------------------------
component custom3_module is
port ( data_in : in std_logic_vector (31 downto 0);
parity_out : out std_logic_vector (31 downto 0));
end component;
----------------------------------------------
component custom4_module is
port ( data_in : in std_logic_vector (31 downto 0);
zero_out : out std_logic_vector (31 downto 0));
end component;
----------------------------------------------
signal reg :std_logic_vector(31 downto 0);
signal sel1 : std_logic;
signal sel2 : std_logic;
signal sel3 : std_logic;
signal sel4 : std_logic;
signal readdata1 : std_logic_vector(31 downto 0);
signal readdata2 : std_logic_vector(31 downto 0);
signal readdata3 : std_logic_vector(31 downto 0);
signal readdata4 : std_logic_vector(31 downto 0);
-------------------------------------------------
begin
reg<=writedata; --op_A and writedata
inst1: custom1_module PORT MAP(
data_in => reg,
crc_en => sel1,
reset => reset,
clk => clk,
crc_out => readdata1
97
);
---------------------------------------------
inst2: custom2_module PORT MAP(
data_in => reg,
one_out => readdata2
);
---------------------------------------------
inst3: custom3_module PORT MAP(
data_in => reg,
parity_out => readdata3
);
----------------------------------------------
inst4 : custom4_module PORT MAP(
data_in => reg,
zero_out => readdata4
);
-------------------------------------
sel1 <= '1' when (opcode = "010000") else '0';
sel2 <= '1' when (opcode = "100001") else '0';
sel3 <= '1' when (opcode = "010001") else '0';
sel4 <= '1' when (opcode = "100000") else '0';
--------------------------------------
process(readdata1, readdata2, readdata3, readdata4, sel1, sel2, sel3,
sel4)
begin
if(sel1 = '1')then
readdata <= readdata1;
elsif(sel2 = '1')then
readdata <= readdata2;
elsif(sel3 = '1')then
readdata <= readdata3;
elsif(sel4 = '1')then
readdata <= readdata4;
else
readdata <= (others => '0');
end if;
end process;
end Behavioral;
98
Appendix C - Trap handler based on ICAP
-------------------------------------------------------------------------
---------
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use ieee.std_logic_unsigned.all;
entity trapHandler is
Port (
opcode : in std_logic_vector (5 downto 0);
dataIn : in STD_LOGIC_VECTOR (31 downto 0);
commandIn : in STD_LOGIC_VECTOR (31 downto 0);
trap_start : in std_logic;
clk : in STD_LOGIC;
rst : in STD_LOGIC;
dataOut : out STD_LOGIC_VECTOR (31 downto 0));
end trapHandler;
architecture Behavioral of trapHandler is
component custom_module is
port ( data_in : in std_logic_vector (31 downto 0);
start , rst, clk : in std_logic;
CustomInstID : out std_logic_vector (5 downto 0);
done : out std_logic;
data_out : out std_logic_vector (31 downto 0));
end component;
component ICAP_SPARTAN6 is
port (
clk : in std_logic;
ce : in std_logic;
WRITE : in std_logic;
I : in std_logic_vector(15 downto 0);
O : out std_logic_vector(15 downto 0);
busy : out std_logic
);
end component;
signal start : std_logic;
signal custom_done : std_logic;
TYPE st IS (st0, st1, st2, st3);
SIGNAL currentState, nextState: st;
signal command_register : std_logic_vector(223 downto 0);
signal command_register_reg : std_logic_vector(223 downto 0);
signal MB_StartAddr : std_logic_vector(23 downto 0);
signal FB_StartAddr : std_logic_vector(23 downto 0);
constant MB_StartAddr1 : std_logic_vector(23 downto 0):=
X"100000";
constant MB_StartAddr2 : std_logic_vector(23 downto 0):=
X"200000";
constant MB_StartAddr3 : std_logic_vector(23 downto 0):=
X"300000";
constant MB_StartAddr4 : std_logic_vector(23 downto 0):=
X"400000";
99
constant FB_StartAddr1 : std_logic_vector(23 downto 0):=
X"100000";
constant FB_StartAddr2 : std_logic_vector(23 downto 0):=
X"200000";
constant FB_StartAddr3 : std_logic_vector(23 downto 0):=X"300000";
constant FB_StartAddr4 : std_logic_vector(23 downto 0):=
X"400000";
signal custom_start : std_logic;
signal icap_datain : std_logic_vector(15 downto 0);
signal icap_dataout : std_logic_vector(15 downto 0);
signal icap_busy : std_logic;
signal icap_write : std_logic;
signal count : std_logic_vector(3 downto 0);
signal opcode1 : std_logic_vector(7 downto 0):= X"00";
signal opcode2 : std_logic_vector(7 downto 0):= X"00";
signal CustomInstID : std_logic_vector(5 downto 0);
begin
--instantiate the ICAP module
ICAP_inst: ICAP_SPARTAN6
port map(
clk => clk,
ce => (not rst),
WRITE => icap_write,
I => icap_datain,
O => icap_dataout,
busy => icap_busy
);
--select ICAP write or read command
icap_write <= '1' when (currentState = st1) else '0';
--send the data to the ICAP module from the command_register_reg
icap_datain <= command_register_reg(223 downto 208);
--implement a shift register to hold the command, which need to be
sent to the ICAP module
process(clk,rst)
begin
if(rst = '0') then
command_register_reg <= (others => '0');
elsif(rising_edge(clk))then
if(currentState = st1)then
--shift left, 16 places
command_register_reg <= command_register_reg(207 downto 0) &
command_register_reg(223 downto 208);
else
command_register_reg <= command_register;
end if;
end if;
end process;
--command, that is to be sent to the ICAP module
command_register <= X"FFFF" & X"AA99" & X"5566" & X"3261"
MB_StartAddr(15 downto 0) & X"3281" & opcode1 & MB_StartAddr(23 downto
16) & X"32A1" & FB_StartAddr(15 downto 0) & X"32C1" & opcode2 &
FB_StartAddr(23 downto 16) & X"30A1" & X"000E" & X"2000";
100
--Master bitstream address selection on the basis of the opcode
MB_StartAddr <= MB_StartAddr1 when (opcode = "010000") else
MB_StartAddr2 when (opcode = "100001") else
MB_StartAddr3 when (opcode = "010001") else
MB_StartAddr4 when (opcode = "100000") else
(others => '0');
--Feedback bitstream address selection on the basis of the opcode
FB_StartAddr <= FB_StartAddr1 when (opcode = "010000") else
FB_StartAddr2 when (opcode = "100001") else
FB_StartAddr3 when (opcode = "010001") else
FB_StartAddr4 when (opcode = "100000") else
(others => '0');
--assign nextState to the currentState on the clock edge
process(clk,rst)
begin
if(rst = '0') then
currentState <= st0;
elsif(rising_edge(clk))then
currentState <= nextState;
end if;
end process;
--decide nextState on the basis of currentState, count, trap_start
and custom_done
process(currentState, count, trap_start, custom_done)
begin
case (currentState) is
--st0 is the reset state, here it will wait for the
trap_start signal
when st0 =>
if(trap_start = '1')then
--if current loaded custom instruction is same as
the required one, then go to st2
--else go to st1
if(opcode = CustomInstID)then
nextState <= st2;
else
nextState <= st1;
end if;
else
nextState <= ST0;
end if;
when st1 =>
--in st1, the command to the ICAP module is sent in the 14 clock cycles
--here it will check the counter, if its equal to 13, then move to ST2
if(count = "1101")then
nextState <= ST2;
else
nextState <= ST1;
end if;
when st2 =>
nextState <= st3;
when st3 =>
101
--now start the custom module, to run the custom
command
if(custom_done = '1')then
nextState <= st0;
else
nextState <= st3;
end if;
when others =>
nextState <= st0;
end case;
end process;
--implement a counter, which is used while sending command to the
ICAP module
process(clk,rst)
begin
if(rst = '0') then
count <= (others => '0');
elsif(rising_edge(clk))then
-- if currentState is st1, then count
if(currentState = st1)then
count <= count + '1';
else
count <= (others => '0');
end if;
end if;
end process;
--instantiate the custom instruction module
inst1: custom_module PORT MAP(
CustomInstID => CustomInstID,
data_in => dataIn,
start => start,
rst => rst,
clk => clk,
done => custom_done,
data_out => dataOut
);
--start the custom module, when state = st3
custom_start <= '1' when (currentState = st3) else '0';
start <= custom_start when ((opcode = "010000") or (opcode =
"100001") or (opcode = "010001") or (opcode = "100000")) else '0';
end Behavioral;