RUN-TIME CUSTOMIZATION OF A SOFT-CORE CPU ON · PDF fileRUN-TIME CUSTOMIZATION OF A SOFT-CORE...

RUN-TIME CUSTOMIZATION OF

A SOFT-CORE CPU ON AN FPGA

A DISSERTATION SUBMITTED TO THE UNIVERSITY OF MANCHESTER

FOR THE DEGREE OF MASTER OF SCIENCE

IN THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES

2015

By

Rehab Abdullah Shendi

School of Computer Science

2

Contents

Abstract ................................................................................................................... 8

Declaration .............................................................................................................. 9

Copyright .............................................................................................................. 10

Acknowledgements ............................................................................................... 11

Dedication ............................................................................................................. 12

1 Introduction ........................................................................................................ 13

1.1 Aim and Objectives ...................................................................................... 14

1.2 Report Outline ............................................................................................. 15

2 Background ........................................................................................................ 16

2.1 Reconfigurable Computing .......................................................................... 16

2.1.1 History ................................................................................................... 16

2.1.2 FPGA ..................................................................................................... 17

2.1.3 Reconfiguration Hardware ..................................................................... 20

2.1.4 Partial Reconfiguration .......................................................................... 21

2.2 Microprocessor Architecture ........................................................................ 26

2.2.1 RISC Microprocessor ............................................................................ 26

2.2.2 Soft-Core Microprocessor ...................................................................... 27

2.2.3 MIPS Architecture.................................................................................. 28

2.3 Reconfigurable CPU Instruction Set Extensions .......................................... 30

2.3.1 Custom Instructions in Hardware .......................................................... 31

2.3.2 Custom Instructions in Software ............................................................ 32

2.4 Design Considerations ................................................................................. 33

2.5 Previous Work ............................................................................................. 35

2.5.1 Instruction Set Extension ....................................................................... 35

2.5.2 Partial Reconfiguration .......................................................................... 35

3 System Design and Methodology ...................................................................... 37

3

3.1 System Development Methodology ............................................................. 37

3.2 Implementation Tools .................................................................................. 44

3.2.1 Hardware Description Language ........................................................... 44

3.2.2 Xilinx ISE (Xilinx, 2013) : ....................................................................... 45

3.2.3 Cross compiler: ..................................................................................... 46

3.2.4 FPGA Platform: ..................................................................................... 46

3.2.5 GoAhead ............................................................................................... 47

3.3 System Design............................................................................................. 47

3.3.1 System Definition and Scope ................................................................ 48

3.3.2 System Architecture and Components .................................................. 48

4 Implementation .................................................................................................. 54

4.1 Baseline MIPS Soft-Core ............................................................................. 54

4.2 Custom Instruction in Software .................................................................... 57

4.3 Configuration Controller Modules ................................................................ 59

4.4 Custom Instruction in Hardware .................................................................. 61

4.5 Challenges During Implementation .............................................................. 69

5 Testing, Results and Evaluation ......................................................................... 70

5.1 Testing ......................................................................................................... 70

5.2 Results ......................................................................................................... 76

5.3 Evaluation ................................................................................................... 77

6 Conclusions and Future Work ............................................................................ 81

6.1 Conclusions ................................................................................................. 81

6.2 Future Work ................................................................................................. 81

Works Cited .......................................................................................................... 83

Appendix A - MIPS CPU ....................................................................................... 87

Appendix B - Trap handler based on MUX ........................................................... 96

Appendix C - Trap handler based on ICAP ........................................................... 98

4

(Word count 16033)

5

List of Tables

Table 1 Configuration speeds with ICAP achievement (Hansen, Koch and

Torresen, 2011). ................................................................................................... 25

Table 2 Type of MIPS instructions (Fritzell,2013). ................................................ 30

Table 3 Descriptions of using ICAP_SPARTAN6 Port (Xilinx Inc, 2015). ............. 52

Table 4 Custom instructions’ address and ID........................................................ 60

Table 5 An example of bitstream for the IPROG command using ICAP (Xilinx Inc,

2015). .................................................................................................................... 61

Table 6 Resource requirements for Configuration controller. ................................ 76

Table 7 Resource requirements for Custom modules. .......................................... 77

Table 8: comparison between Xilinx Embedded processors with our soft-core and

their Performance. ................................................................................................ 79

Table 9 Software requirements. ............................................................................ 80

6

List of Figures

Figure 1 Classification of FPGAs (Koch, 2013). .................................................... 21

Figure 2 Baseline model of partial reconfiguration (Koch, 2013)........................... 22

Figure 3 Styles of reconfigurable modules placement. (a) Island style. (b) Slot

style. (c) Grid style (Koch, 2013). .......................................................................... 23

Figure 4 a) a typical CPU b) extensions CPU with Reconfigurable Instructions

(Koch, 2013). ........................................................................................................ 31

Figure 5 Design and development tools (Minev and Kukenska, 2007). ................ 34

Figure 6: The general approach of the system development stages (Soft, 2013). 37

Figure 7: A step-by-step design and implementation method. .............................. 38

Figure 8: First step, system overview. ................................................................... 39

Figure 9 Third Step, system overview. .................................................................. 40

Figure 10: Four step, system overview. ................................................................ 40

Figure 11 Five step: system overview of the first approach. ................................. 42

Figure 12 Five step: system overview of the second approach. (Xilinx, 2012). ..... 43

Figure 13 Xilinx Spartan-6 LX16 FPGA platform (Nexys3™ Board Reference

Manuall, 2013). ..................................................................................................... 46

Figure 14 The final system design. ....................................................................... 49

Figure 15 The non-pipelined MIPS shows the most important signals and logics

(Fritzell, 2013). ...................................................................................................... 50

Figure 16: ICAP Primitive (Xilinx Inc, 2015). ......................................................... 52

Figure 17 Custom Module Logic. .......................................................................... 53

Figure 18 The Program Counter process overview that consists of extra logic and

flip-flops to handle branch and jump instructions. (Fritzell, 2013). ........................ 55

Figure 19 Datapath for the multiplication, allowing two clock cycles for execution.

(Fritzell, 2013). ...................................................................................................... 57

Figure 20 Adding Custom instruction in the compiler. ........................................... 58

Figure 21 Trap Handler State Machine. ................................................................ 59

Figure 22 Custom Instruction (CI) act as extension of the ALU ............................ 61

Figure 23 On-FPGA Communication for Custom Instructions............................... 62

Figure 24 Static implementation ............................................................................ 64

Figure 25 Partial Part: the example shows the implementation CRC instruction. . 66

Figure 26 GoAhead GUI. The graphical user interface of the GoAhead. .............. 68

7

Figure 27 GoAhead Script. ................................................................................... 68

Figure 28 Test-bench of the MIPS CPU and ROM all pictures above a, b and c are

presenting one test bench that shows different signals for example A) instruction

encoding, decoding and ALU functionalities b) Program counter functionality and

c) branch delay and ROM functionalities. ............................................................ 71

Figure 29 Modalism Simulation of CRC-32 Module. ............................................. 72

Figure 30 Modalism simulation of One Counter Module. ...................................... 72

Figure 31 Modalism Simulation of Parity generation module. ............................... 73

Figure 32 Modalism simulation of Leading Zero Counter Module. ........................ 73

Figure 33 Modalism Simulation of Mux Based TrapHandler. ................................ 74

Figure 34 Modalism simulation of ICAP based Trap Handler. .............................. 74

8

Abstract

RUN-TIME CUSTOMIZATION OF

A SOFT-CORE CPU ON AN FPGA Rehab Abdullah Shendi

A dissertation submitted to the University of Manchester For the degree of Master of Science, 2015

The use of customised soft-core processors in which instructions can be

integrated into a system in application hardware is increasing in the Field

Programmable Gate Array (FPGA) field. Specifically, the partial run-time

reconfiguration of FPGAs in specialised processors for a particular domain can be

very beneficial. In this report, the design and implementation for the customisation

of a soft-core MIPS processor using an FPGA and partial reconfiguration (PR) of

FPGA technology will be addressed to achieve efficient resource use. This can be

achieved using a PR design flow that helps the design fit into a smaller device.

Moreover, the impact of static power consumption could be reduced due to run-

time reconfiguration. This will be done by configurable custom instructions

implemented in the hardware as an extension on the MIPS CPU. The aim of this

project is to investigate the PR of FPGAs for run-time adaptations of the

instruction set of a soft-core CPU, including the integration of custom instructions

and the exploration of the potential to use the MultiBoot feature available in Xilinx

FPGAs to carry out the PR process. The system will be evaluated and tested on a

Nexus 3 development board featuring a Xilinx Spartran-6 FPGA. The system will

be able to load reconfigurable custom instructions dynamically into user programs

with the help of the trap handler when the custom instruction is called by the MIPS

CPU. The results of this experiment demonstrate that custom instructions in

hardware can speed up a certain function and many instructions can be saved

when compared to a software implementation of the same function. Implementing

custom instructions in hardware is perfectly possible and worth exploring.

9

Declaration

No portion of the work referred to in this dissertation has been submitted in

support of an application for another degree or qualification of this or any other

university or other institute of learning.

10

Copyright

i. The author of this thesis (including any appendices and/or schedules to this

thesis) owns certain copyright or related rights in it (the “Copyright”) and s/he has

given The University of Manchester certain rights to use such Copyright, including

for administrative purposes.

ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic

copy, may be made only in accordance with the Copyright, Designs and Patents

Act 1988 (as amended) and regulations issued under it or, where appropriate, in

accordance with licensing agreements which the University has from time to time.

This page must form part of any such copies made.

iii. The ownership of certain Copyright, patents, designs, trademarks and other

intellectual property (the “Intellectual Property”) and any reproductions of copyright

works in the thesis, for example graphs and tables (“Reproductions”), which may

be described in this thesis, may not be owned by the author and may be owned by

third parties. Such Intellectual Property and Reproductions cannot and must not

be made available for use without the prior written permission of the owner(s) of

the relevant Intellectual Property and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication and

commercialisation of this thesis, the Copyright and any Intellectual Property and/or

Reproductions described in it may take place is available in the University IP

Policy (see http://documents.manchester.ac.uk/DocuInfo.aspx? DocID=487), in

any relevant Thesis restriction declarations deposited in the University Library,

The University Library’s regulations (see

http://www.manchester.ac.uk/library/aboutus/regulations ) and in The University’s

policy on presentation of Theses

http://documents.manchester.ac.uk/DocuInfo.aspx

http://www.manchester.ac.uk/library/aboutus/regulations

11

Acknowledgements

I would like to thank my supervisor, Dirk Koch, for giving me the opportunity to

work in my favourite, and dream field in Computer Sciences: Computer System

Engineering. His remarkable teaching and coaching strategies enabled me to give

my best from the first day; without him this dream would not have been realised.

Special thanks also go to my parents, sisters and my small family - Fahad, Qusai

and Retal - for their help and encouragement during my studies. Thanks also to

my friends for their support.

12

Dedication

From my heart to my brother—you are still here in my heart and mind. I miss you

always, my best friend.

13

Chapter 1

1 Introduction

Field Programmable Gate Arrays (FPGAs) have become popular over the last

decade as they allow designers to create complex digital designs at a low

implementation cost. Application Specific Circuits (ASICs), in contrast, introduce a

high initial cost and require a large amount of resources to create complex

designs.

Modern FPGAs now occupy central positions in industry because of their capacity

for over 1000 multipliers, megabytes of on-chip memory, hundreds of thousands of

logic cells and clock speeds of up to half a gigahertz. Moreover, the cost per

function in FPGAs decreases significantly over time (Koch, 2013).

Partial Reconfiguration (PR) is one of the most important features of modern

FPGAs provided by the FPGA vendor Xilinx. It allows modules running on an

FPGA to dynamically reconfigure and swap during execution while the remaining

modules continue operating. PR is an interesting topic for research among

students and researchers in the Reconfigurable Computing and Adaptive

Hardware field. FPGAs are less efficient in area, power and speed than ASICs;

however, it is possible to make them more efficient than a static system when all

or parts of the hardware are reconfigured at run-time through the execution

operation.

The extension of a soft-core instruction set with user-defined instructions used to

speed up the execution of an application in a specific domain can provide huge

PR benefits. Such benefits include integrating different sizes of reconfigurable

modules into the system to be placed on an FPGA at run-time, and being able to

communicate efficiently with the rest of the system and avoiding additional delay.

In this project, the extension of a MIPS soft-core, user-defined instruction set will

be introduced with the help of PR. The aim of this project is to explore the efficient

use of partial run-time reconfiguration with a CPU instruction set extensions

library.

14

This chapter presents basic information about the project. Section 1.1 describes

the aims and objectives of the project, and section 1.2 presents the report outline.

1.1 Aim and Objectives

The aim of this project is to investigate Partial Reconfiguration (PR) of FPGAs for

run-time adaptations of the instruction set of a Soft-core CPU, including the

integration of custom instructions by presenting a practical introduction to soft-core

processor with extension design through the use of step-by-step integration of the

system for partial reconfiguration using GoAhead tool flow. The powerful GoAhead

tool supports all recent Xilinx FPGAs and includes some features that are not

available in the other PR tools provided by the FPGA vendor Xilinx (Beckhoff, et

al., 2012) as will be introduced in chapter 3.

The objective of this project is to investigate a custom instruction module library

that offers low latency performance; low implementation costs in terms of logic

resources, and achieves high CPU clock cycle savings compared to software-only

implementations.

• Learning Objectives

– Investigate and understand the concept of reconfiguration hardware.

– Review how custom instructions can be applied as an extension of the soft-

core.

– Investigate and understand the concept of PR.

– Investigate and understand the topic of reconfiguration MultiBoot and its

potential for use with PR.

• Deliverable Objectives

– Develop and implement custom instructions as an extension of a given soft-

core on an FPGA.

– Understand and implement reconfigurable custom instructions for a soft-core

on an FPGA.

– Analyse previous results and establish a performance concept.

15

1.2 Report Outline

Chapter 2: Background

This chapter will provide an overview of the relevant literature and related works

as an introduction to reconfigurable hardware and FPGA architecture. PR

concepts and details regarding the reconfiguration of FPGA devices will be

included. Finally, microprocessor architecture, with a focus on MIPS and

reconfigurable instruction set extensions, will be introduced.

Chapter 3: System design and methodology

This chapter introduces the system methodology considered for this project. The

whole system used in the project, including the MIPS CPU and the peripheral

components (memory, GPIO, ROM, and trap handler) connected by the system

bus will be presented.

Chapter 4: System implementation

This chapter discusses the implementation of the final system’s components and

all related technical issues.

Chapter 5: Testing, results and evaluation

This chapter presents the tests conducted in this study, the results of these tests

and an overall evaluation of the system.

Chapter 6: Conclusion and further work

This chapter summarises the report and presents recommendations for further

improvement of the implemented system.

Appendix

Three appendices have been included:

Appendix A contains the VHDL-code for the MIPS CPU

Appendix B contains the VHDL-code for the trap multiplexer

Appendix C contains the VHDL-code for the trap handler.

16

Chapter 2

2 Background

Three areas are dealt with in the background research. Firstly, the general area of

reconfigurable computing including FPGAs architecture is discussed. Then, in the

second part, Microprocessor architectures are discussed. Finally, the third part

looks at the specific area of this project.

2.1 Reconfigurable Computing

Reconfigurable computing is a computer paradigm that combines the flexibility of

software with high hardware processing performance through the use of flexible

high speed fabrics such as FPGAs. Reconfigurable computing provides the ability

to make substantial changes to the data path with the control flow. Additionally,

reconfigurable computing is able to adapt the underlying hardware during run-time

by providing the option to load a new circuit on the reconfigurable fabric (Koch,

2013).

2.1.1 History

According to Bobda (2008) the history of reconfigurable computing can be traced

back to 1960s when Gerald Estrin proposed a computer architecture that was

made up of a standard processor combined with an array of reconfigurable

hardware. The core processor was used to control the behaviour of the

reconfigurable hardware. Such a design was later adjusted to perform other tasks

such as image processing (Lysaght & Subrahmanyam, 2005). The adjustment

was commonly done whenever the need arose. These adjustments could be

performed whenever the need arose and led to the development of a hybrid

computer structure that possessed both software flexibility and speed.

Since then, the design of reconfigurable computing has improved as many

architectures have been developed by industry. Some of the designs that have

been introduced to the market include Copacobana, Elixent, Silicon Hive, PiCoGA

etc. The first reconfigurable architecture based computer for the commercial

market was released in 1991 by Algotronix. This architecture was later adopted by

17

Xilinx, which acquired Algotronix to improve it for commercial purposes

(Algotronix.com, 2015).

2.1.2 FPGA

Field Programmable Gate Array (FPGA) technology has recently gained a lot of

popularity in production and prototyping products in both small and moderate

quantities. FPGAs are a special kind of Programmable Logic Devices (PLD) that

allows the implementation of general digital circuits with a limitation of the circuit

size. Programming the device is used to define the circuit to be implemented. The

capabilities of FPGAs have grown over the years and today a whole

multiprocessor system can fit on a single device. The complex circuit designs

needed for such complex devices are normally specified with the help of Hardware

Description Languages (HDLs). As they support circuit description with the help of

high-level language constructs, HDLs are preferred for this type of application.

FPGAs are comprised of a chip full of digital logic which allows for programmable

connections between components. FPGA design tools are used to generate

configuration files that contain the initial values and the required connections

which can then be downloaded to the FPGA. The key feature of FPGAs lies in the

fact that their design is completely soft and that it can be reprogrammed. However,

this also means that if power is removed from them, they will lose their

configuration. As such, they will require reprogramming in order to create another

working design (Balwaik, et al., 2013).

The history of FPGAs dates from the late 80’s with the increasing interest in

extending the functionality of large Programmable Logic Arrays (PLAs) that were

being further developed (Bobda, 2008). The early 90’s witnessed the increased

use of FPGAs in the networking and telecommunication industry due to their

increased flexibility. At that time, they were preferred because it was possible to

separate the development stage and hardware design from the logic design stage.

As such, they were seen as helping vendors to engineer solutions without

spending lot of time in designing the logics which was the case in Application

Specific Integrated Circuits (ASICs) (Parvez and Mehrez, 2011).

A part of the background to this study is soft-core processors. A soft-core

processor is regarded to be a microprocessor that is completely described using a

18

Hardware Description Language and is synthesized for FPGAs. At this point it is

worth mentioning that the design of a soft-core processor that has been designed

for an FPGA is considered to be flexible due to its ability to be readjusted by

reprogramming the device. This is not possible with much other programmable

hardware. Traditionally, such systems could be developed using ASIC technology.

However, ASICs are traditionally not designed for allowing reconfiguration. FPGAs

have been demonstrated to create very powerful and highly performing systems

because of their reprogramming feature (Musoll, 2010).

One limitation of FPGAs is that very few details of the low level implementation

process are available to the end users (e.g. the encoding of the configuration

data). Sufficient information about the choices made during the development

process of FPGA technology is not often provided.

FPGA Architecture

FPGA technology can be implemented using arbitrary user logic. There are three

main resources available in FPGAs: 1) logic blocks, 2) I/O blocks and 3) a

programmable interconnection.

Logic blocks

FPGA logic blocks consist of a look-up table (LUT) and flip-flops (FF). Each logic

block has the ability to implement small functions consisting of several variables.

The FPGA implements the Boolean logic with the help of the LUTs, which are the

basic elements in FPGA architecture, providing the capability of programming

whenever given a logic function (as long as it fits into the LUTs). A Boolean

function is normally represented by a truth table stored in static random access

memory (SRAM) cells. A LUT is normally linked to specific inputs; those with n

inputs are referred to as n-LUTs (Munden, 2005). As such, an n-LUT is essentially

a multiplexer that takes input signals from the configuration storage memory and

forwards the selected one into an output signal line.

LUT outputs are normally linked to the state flip-flop, which is supposed to store

the current state of the synchronous circuit. Practical look-up tables provide

additional features that vary among different families of FPGAs. Some of the

features witnessed on different FPGAs include distributed memory modes and the

potential to combine adjacent LUTs with larger LUTs that have more inputs and

19

fast-carry ripple chain logic (Pedroni, 2010). In other words, LUTs are combined

during the routing implementation with configurable registers and multiplexers in

order to produce a logic cell. A logic cell is the main thread of the FPGA fabric in

that all unmapped logic in special blocks like DSPs, CPUs or BRAMs is

implemented in logic cells. Xilinx FPGAs, for example, have recently begun to

provide four logic cells combined as a slice, creating a configurable logic block

(CLB) (a combination of two slices). Slices can consist of logic other than basic

logic cells to implement fast carry chains, shift registers and distributed RAM by

adding dedicated signals and logic between slices in the same column to

propagate signals through many slices. This removes the need for routing through

the interconnect.

I/O blocks

I/O blocks are used to connect the internal logic to the outside pins. I/O blocks are

bidirectional, meaning they can either be used as inputs or outputs depending on

the actual configuration. Different pins may be configured to different standards if

the underlying device can support more than one I/O standard (Munden, 2005).

Programmable interconnection.

Programmable interconnections are used to connect different logic blocks. The

interconnections between FPGA logic blocks may be programmed in three ways:

via SRAM cells, FLASH/electrical erasable programmable memory (EEPROM) or

antifuses. These hold the configurations defining the Boolean function and control

the configured routing.

The majority of FPGAs are SRAM-based programmable interconnects. The SRAM

cells drive pass transistors, tri-state buffers and multiplexers. SRAM is a volatile

memory technology and needs to be programmed from an external memory each

time power is applied to the device. During reconfiguration, these SRAM cells will

be overwritten with new functions. FLASH, which is based on EEPROM

technology, is non-volatile and will retain configuration data when power is

removed from the device. Antifuse-based programmable interconnects create

permanent connections in the configuration cells. Similar to FLASH, these

interconnects may only be programmed a single time, after which point the

configuration process cannot be redone.

20

The above architectures indicate the complex programming capabilities of FPGAs

and may account for some of the problems involved in FPGA use. These

problems include the fact that FPGAs consume a lot of power during programming

and they also require a large amount of space which results in latency of routing

and functional blocks. FPGAs also consume a significant amount of power and

configuration memory during operation. Compared to ASICs, FPGAs also exhibit

longer circuit delays (Lin et al., 2008).

Configuration details

FPGA configuration occurs when a bitstream is written onto a device’s

configuration port. The bitstream contains data for the SRAM cells that hold the

device’s configuration. There are two types of configuration ports: external and

internal. They have different interfaces to accommodate specific protocols and

connections. Xilinx FPGA devices support regional reconfiguration on the device

during run-time. The smallest region is a reconfigurable one, and its configuration

frame varies in size depending on the device.

2.1.3 Reconfiguration Hardware

The processors used in computing may be classified into three types (Bobda,

2007). The first, a general purpose processor (GPP), employs data, a control path

and a data path to conduct computation, and does not necessarily alter the

existing hardware. The second, a domain-specific processor (DSP), is used in

situations in which a processor is only employed in one particular computation

area. DSP data paths and operations are fitted to a set of algorithms, which

reduces flexibility though boosted performance for underlying domains. The third

type, an application-specific processor (ASIP), achieves the best performance by

directly executing the hardware algorithm. Moreover, it does not employ

instructions, which implies that unlike the other processors, it is not restricted by

the need for sequential implementation.

The ideal processor would be one that combines the flexibility of GPPs with the

performance power of an ASIP. Modern FPGA technology makes this possible as

they can adapt to different problems in a form called reconfigurable hardware, in

which all or parts of the hardware structure can be changed during execution.

Despite the high static power consumption of modern FPGA devices, run-time

21

reconfiguration can create flexible hardware by increasing device utilisation

through device reconfiguration.

The architecture of FPGAs can be seen from the perspective of their configurable

capabilities: the highest level of FPGA can be separated into one-time

configurable devices and reconfigurable devices. Figure 1 illustrates the major

classifications of FPGAs in regard to their configuration capabilities.

Figure 1 Classification of FPGAs (Koch, 2013).

A globally reconfigurable device allows complete device configuration exchange,

while partially configurable devices permit the exchange of only a fraction of the

FPGA resources. PR can be accomplished either with active or passive operations

(i.e. if the FPGA continues or stops operation during configuration).

2.1.4 Partial Reconfiguration

PR is associated with the ability of a reconfigurable device to change a portion of

the reconfigurable hardware circuitry while the other portion is still running. Such

reconfigurable designs require modular circuits created by different

subcomponents. It is possible to swap out some sections of these subcomponents

even when the FPGA is still running (Koch, 2013).

A full reconfiguration operation is normally done when the FPGA is in the reset

mode, at which time an external controller is employed to reload the design into

the chip; this improves functionality to critical parts of the design. In addition, PR

can be used to create space for multiple modules at run-time by storing the

partially reconfigurable modules expected to be changed. Figure 2 illustrates the

baseline model of PR.

22

Figure 2 Baseline model of partial reconfiguration (Koch, 2013).

Figure 2 shows how active modules are exclusively placed within the

reconfigurable region and how the swapping between the modules is

accomplished through writing a partial configuration bitstream to the configuration

port, as seen by the configuration data stream in the right hand side of Figure 2.

PR is available in most modern FPGAs and allows a subset of the logic fabric to

be dynamically reconfigured while the logic in it continues to operate undisturbed.

Some of the FPGAs equipped with this capability include the devices of FPGA

vendors Xilinx and Altera, which include this feature on their high-end FPGAs. PR

is not only necessary for general purpose reconfigurable systems but is preferred

due to its extensibility and flexibility (Koch, 2013).

To undertake partial run-time reconfiguration, hardware must be supported by the

devices mentioned above. Reconfiguration in one section of the device must not

stop operation in other sections. PR may be classified according to the frequency

of reconfiguration applicable within an operation clock cycle. These classifications

are: single-cycle reconfiguration (frequently applicable), sub-cycle reconfiguration

and multi-cycle reconfiguration (seldom applicable). In multi-cycle reconfiguration,

reconfiguration requires more than a single system clock cycle because the

reconfiguration data is transferred from memory to configuration cells in a serial

fashion. Single-cycle reconfiguration occurs when a redesign involves a change of

logic on the device within a single chain of the system clock. Context switching

may not be undertaken in run-to-completion modules, as the module’s internal

state would not be stored.

The reconfigurable system is divided into two important parts. The part of the

system that is always present is called the static region, and can include a

23

memory controller, a soft CPU or configuration port interface logic. The second

part, which contains run-time reconfigurable modules, is typically provided as one

or more partial regions. Different methods of conducting PR exist, including small

changes in net lists, routing and LUT functions, or even large module replacement

(Koch, 2013).

Style of module placement

There are various methods available for PR; for example, the manner in which the

area set for PR is employed categorises PR into different styles of configuration.

One method of conducting PR is substituting larger portions of logic known as

modules for every reconfiguration. This is termed module-based reconfiguration.

The area where PR modules are placed could be: a) only one module in a

reconfigurable region b) in a one dimensional fashion or c) a two dimensional

fashion. The following figure 3 shows the partial region and the different styles that

can be arranged in it.

Figure 3 Styles of reconfigurable modules placement. (a) Island style. (b) Slot style. (c)

Grid style (Koch, 2013).

Island styles are supported by the Xilinx PR flow. In the “island style", only one

module will be present in the PR region, while switching between other modules

could be carried out in the static part of the system. A PR region has to

accommodate all modules that the system will need. The design could be a single

or multi island style. With the latter one the developer should consider that the

same resources will be shared by all of the islands. On the other hand, in the “slot

style", PR regions will be divided into slots that have the same size. So, it will be

not be limited to one module as in the "island style". Varying slot requirements for

different modules could cause fragmentation challenges inside the PR region. As

24

a result, replacing modules in the "slot style" will not be as straightforward as in

the "island style", in which there is only the matter of choosing between the islands

(Koch, 2013).

Module footprint

Interchanging modules between various islands/slots found on the device requires

the designer to consider the resources required for the module. It also needs the

existing FPGA frameworks and the manner in which resources are placed on the

device to be considered. The PR module bears a resource footprint which has to

fit the resource footprint of the existing FPGA. Therefore, when a module is

changed to a new group of slots, the slots have to perfectly fit the module

footprint. There are challenges when permitting module relocation. One challenge

is the alteration in signal timing and incorporating a timing footprint. There could

be a change in timing based on the position of the module relocation. Other

sections of the FPGA could have longer delays in routing due to concealed

features, for instance, the configuration logic.

Spartan-6 configuration

Configuration frames are an integral component of the Spartan-6. The

configuration frames for the devices of Spartan-6 could be classified into three

kinds that have specific data for various parts of the device (Xilinx Inc, 2013). They

include: Type 0; Type 1, or the Block RAM; and Type 2, or the IOB. Configuration

is conducted using three kinds of operations that are offered by the configuration

logic. They include: "00": NOP; "01": READ; and the "02": WRITE. The execution

of a configuration command occurs in the event that a configuration register is

drafted using data (Xilinx Inc, 2011). Each and every configuration register is

described in the user guide of Spartan-6-configuration (Xilinx Inc, 2015).

Configuration data is designed into two kinds of packets: Type 1 which has short

blocks of 16-bit data areas; and type 2 in which packets could have long blocks of

multiple 16-bit wide data areas.

Spartan-6 bitstream

In order to configure a Xilinx device a bitstream to one of the configuration

interfaces needs to be applied. The bitstream, as mentioned before, is an

25

encapsulation for the configuration data packets. The format of the bitstream in

Spartan-6 devices is as follows (Xilinx Inc, 2015):

Dummy words: To prepare the pipeline of the configuration interface for

data.

Synchronisation words: Two 16-bits words used for synchronisation

(0xAA99 and 0x5566).

Header.

Configuration body.

Header2.

De-synchronisation word: One word (16-bit) signalling the end of the

bitstream (0x000D).

In the reconfiguration, in order to set up configuration registers, the header will be

used, whereas, in the configuration body, data will be written to the configuration

frames of the device. While Header2 could be also used for setting different

configuration registers.

In the reconfiguration, in order to set up configuration registers, the header will be

used, whereas, in the configuration body, data will be written to the configuration

frames of the device. While Header2 could be also used for setting different

configuration registers.

Internal Configuration Access Port (ICAP)

During run-time reconfiguration, the system will have to write the configuration

data into the configuration cells. In other words, writing data to the Internal

Configuration Access Port (ICAP) on Xilinx devices. ICAP could consider the

internal version of SelectMap port; one of the external configuration ports on

Spartan-6. The following table shows the configuration speeds achievements.

Bit width Frequency MHz Configuration speed Mb/s /MB/s

8 bit 100 800/100

16 bit 100 1600/200

Table 1 Configuration speeds with ICAP achievement (Hansen, Koch and Torresen, 2011).

26

On Spartan-6 devices (Xilinx Inc, 2015), the ICAP_SPARTON6 primitive has an

input (I) data port that can accept 8- or 16-bit words of configuration data and an

output (O) port which is used for read-back of configuration data already present

on the device. Controlling the primitive will be done by setting the write enable

(WRITE) and clock enable (CE) signals. And the data will be read or written by the

primitive on the rising edge of the clock (CLK).

Relocation of partial module bitstreams

Module relocation occurs when the system is able to shift modules between

various slots, as opposed to fitting a module to a particular slot in the PR area.

The benefit of module relocation is the achieved dynamism in module placement.

Challenges including external fragmentation can be handled with ease because

modules can be eliminated between various slots. In addition, its flexibility makes

the task of discovering placement and module scheduling much easier. This is

because every module matches more than a single slot. There are various

methods of executing module relocation. One such method will be to establish a

different bitstream for every slot one needs to put his module in. A major solution

in reducing storage within a system which boosts module relocation will be to keep

position independent bitstream data distinct from position dependent. Based on

this, it is just the position dependent data that must be kept for every position.

2.2 Microprocessor Architecture

2.2.1 RISC Microprocessor

RISC, or Reduced Instruction Set Computer, is a type of microprocessor

architecture that is designed to have instruction sets consisting of small, same size

and simple instructions in order to make the whole architecture faster by executing

them within one cycle. Moreover, RISC CPUs require less use of the memory

when they are designed with a larger number of registers and only two dedicated

instructions; load and store instructions that allow access to the memory.

Whereas, CISC, Complex Instruction Set Computing, which is the opposite of

RISC, can perform memory access from many different instructions. Examples of

well-known RISC processors that are used widely in different hardware devices

around the word are DEC Alpha, AMD Am29000, ARC, ARM, Atmel

27

AVR, Blackfin, Intel i860 and i960, MIPS, Motorola 88000, PA-

RISC, Power (including PowerPC), RISC-V, SuperH, and SPARC.

2.2.2 Soft-Core Microprocessor

Soft-core processors have been wholly implemented using logic synthesis and

through different semiconductor devices containing programmable logic. There are

many soft-core processors that have been targeted for FPGA implementation. A

typical soft-core CPU includes instruction sets, register files, arithmetic-logic units

and other features eventually. The performance of these Soft-core CPUs

implemented on FPGAs is considered to be higher when compared to those

implemented on ASICs architecture. The disadvantage of an FPGA

implementation is that it involves additional reprogramming capability that is not

found in the ASIC architecture. However, the soft-core CPU created can be

improved, if a problem with the design is found. This is one of the advantages of

FPGA technology over the ASIC technology. For example, a new performance

requirement of the CPU can be matched by adjusting the parameters on the

FPGA of the system.

As mentioned above, there are many types of soft-core CPUs and corresponding

development tools. Some popular soft-core CPUs include; Xilinx MicroBlaze,

Altera Nios/NiosII, LatticeMico32 etc. These CPUs offer logic and memory

elements that have several intellectual property peripherals which are required in

the rapid development of System-on-Programmable-Chip.

A number of the soft-core processors that have been developed using FPGA

technology are discussed below, and their functional details and performance

provided (Levy and Conte, 2009).

MicroBlaze soft-core processor

One of the most popular soft-core processors is the MicroBlaze soft-core

processor from the FPGA vender Xilinx. It has a 32-bit Reduced Instruction Set

Computer (RISC) architecture and can be customised with a number of memory

and peripheral configurations. There are three pipeline stages that contain

variable length instruction latencies. The Xilinx Platform Studio software can be

used in the design process which provides a user-friendly environment that is able

to generate MicroBlaze system. This type of architecture was adopted the Havard

https://en.wikipedia.org/wiki/Blackfin

https://en.wikipedia.org/wiki/SuperH

28

memory architecture which consists of two local memory busses: one that is used

to connect the data memories; the other that is used for the instructions. The

number and size of memory peripherals can be selected by the user. The

processor is capable of operating at up to 200MHz in Virtex -4 devices (Le Gal

and Jego, 2013).

NIOS II Soft-core processor

This soft-core CPU has load-store RISC architecture. The processor consists of

many architectural parameters that may be configured easily at the time of design.

For example, the user has a chance to choose between 32 or 16 bits of datapath

width, cache size and register file sizes. There are custom instructions used to

help the user to customise the hardware this could be used to accelerate the CPU.

The integration of off-the-shelf intellectual property is readily realised, thus

reducing the time that is required to set up a SoC and design time

(Microelectronics International, 2012).

Micro32 Soft Processor core

This is another example of soft-core processor but one that is in many ways unlike

the other two examples that have been discussed above. Although it employs

RISC architecture just like the above two examples, it is completely open.

Additionally, it uses a smaller number of LUTs on the FPGA which makes it

cheaper when compared to the others and it is easy to configure for the options

you want to have in your application (Chu, 2008).

2.2.3 MIPS Architecture

MIPS Overview

In this project, the MIPS architecture, shown in Figure 15 below, will be used as a

demonstrator for the custom instruction implementation in hardware. It is used to

implement a 32 bit embedded system. Moreover, it is an example of RISC

architecture and one of the most widely supported processors and has been used

in research on efficient processor organisations which can deliver the highest

performance and high power efficiency.

The original MIPS architecture consists of the following functional blocks:

29

Instruction decoder: It will decode the simple MIPS instructions since all

instructions are the same size with only three different formats.

Programme Counter (PC): It contains the address of the currently executed

instruction and then increments the stored value address of the next instruction by

4. In the case of there being a branch or jump instruction, a delayed branch will

occur, which means one more instruction is performed and the value that is

provided by the branch or jump instruction will be added to the instruction address.

Arithmetic Logic Unit (ALU): it is a fundamental block of the CPU that performs

arithmetic and logical operation on the operands, which are the data inputs to an

ALU to be operated on, from register to register, memory to register or vice versa.

Registers: the MIPS processor has 31 general purpose registers including

register 0 that holds a constant zero. The other registers will be used by the

compiler as outlined in the "MIPS32® Instruction Set Quick Reference"

Memory: It will be only accessed via load and store instructions.

Pipeline registers are often placed between the functional blocks in order to allow

the processor to run at high clock speeds and to minimise the delay. Basically, the

MIPS processor has been designed to use pipelining to improve throughput and

performance. It includes a 5-stage pipeline: Instruction Fetch, Instruction Decode,

Execute, Memory access and Register write back..

MIPS Instruction Set

The MIPS instruction set is divided into three core groups of instructions. Each

one of them has its own encoding, as illustrated in the following table.

Instructions

type

BITS

31-26 25-21 20-16 15-11 10-6 5-0

R-type opcode rs rt rd shamt funct

I-type opcode rs rt immediate

J-type opcode address

30

Table 2 Type of MIPS instructions (Fritzell,2013).

Table 2 shows that each type has a 6-bit main opcode that can be used by the

decoder to determine the instruction, while the other fields, rs, rt and rd, will be

address vectors in the registers file. Those instructions are used for:

• R-type instructions are Arithmetic Instructions that use two operands from

the register file, rs and rt, and the result of the operation will be returned to

the register rd. The R-type instruction could share their opcode with other

instructions and funct-code will determine the operation.

• I-type instructions are Load / Store Instructions that use a register, rs, with a

constant value, coded as the immediate, the result will be returned to the

register rt. The I-type instruction could be used for braches, so the

immediate will be added to the current PC to perform a branch.

• J-type instructions are Jump instructions that provide a new address for the

programme counter. This means moving the execution to a new code

block.

2.3 Reconfigurable CPU Instruction Set Extensions

Many different applications could be handled by using only GPPs, General

purpose processors. However, most of them could use only a small subset of all

the available instructions in the GPP. Therefore, some small changes to dedicated

hardware in any application could give a huge improvement in execution time. A

compression algorithm, for example, would need to count the number of one-bits

in a vector. By adding dedicated hardware instruction, the speed up of this

algorithm will be increased.

Extending the instruction set of a CPU could be one way to do this, allowing for

hardware acceleration of small parts of an application. The Microblaze and the

Nios soft-core CPUs from Xilinx and Altera are good examples of CPUs that allow

custom instructions with the benefits of a fast RISC machine. The next section will

highlight the interesting points regrading custom instructions.

31

2.3.1 Custom Instructions in Hardware

Custom instructions enable a designer to implement a complex sequence of

standard instructions into a simpler and single instruction built in hardware. The

simple description of implementing such a custom instruction in a MIPS CPU, and

one that can access the register file in the same way as an ALU is shown in Figure

4.

Figure 4 a) a typical CPU b) extensions CPU with Reconfigurable Instructions (Koch, 2013).

Figure 4 shows that extending the CPU with exchangeable instructions could be

done after decoding unused instruction in the original CPU ISA. Then a

multiplexer is used in order to select between normal ALU option and one or more

user defined instructions. Then the configurable instruction can be integrated into

the CPU (Koch D, 2013).

The custom instruction logic block has two input ports and one output result, as

shown in Figure 4. Often, custom instructions operate in a single clock cycle.

However, a multi-cycle operation can be considered for longer combinatory paths.

Through the use of custom instructions, it becomes possible to tailor the processor

core to a certain application.

One way to emulate the configuration instructions is by adding large

reconfigurable accelerator modules multiplexer that can be placed outside the

CPU on the system bus. However, this approach will involve an additional cost.

Another way to configure such a custom instruction in hardware is by using run-

time partial reconfigurable. The custom instruction could be placed in small

slots/islands close to the MIPS CPU, which could cause routing congestion

because a high number of signals need to be entered inside the small area.

Devices from Altera or Xilinx support design flow tools such as PlanAhead, Open

32

PR and GoAhead flow, such a design flow can communicate between the static

system, which includes MIPS CPU, with custom instructions as they can

implement the interface between the static and partial system. By using bus

macros, proxy logic or direct mapping wired technique that are provided by

PlanAhead, OpenPR and GoAhead flow tools respectively.

Fritzell (2013), who proposed a fast dynamic partial reconfiguration system using

GoAhead, argued that with a high number of signals and small islands/slots,

design flows using bus macros or proxy logic could not give good results,

considering the communication overhead. He shows that by using GoAhead with

the direct wire approach, the implementation of the custom instructions can be

very efficient in small islands/slots. Consequently, the modules can be relocated.

The benefits of allowing the custom instruction to be relocated in more than one

slot are the flexibility of slot utilisation, the reduction of the external fragmentation

and the removal of unnecessary reconfiguration calls as mentioned by (Koch et

al., 2010). As a result, the processor will need a look-up table to store a location of

a slot that has a custom instruction so that the decoder will know from which

custom instruction slot the result should be routed (Fritzell, 2013).

2.3.2 Custom Instructions in Software

Reconfiguration of custom modules could be done either by run-time partial

reconfiguration or by a multiplexer that emulates the configuration process, as

already mentioned above, and the reconfiguration time could be the biggest

overhead. So, in order to trigger the configuration process, there are two

fundamental options:

Explicit approach: the configuration instruction will be loaded during the

execution time by the user or by the program, before the processor needs it.

Hauck (1998) proposed this method as the configuration pre-fetch instructions

before the instruction is called. It could be fast. However, the speed of the

configuration controller and the size of the bitstream will affect the time that the

reconfiguration of the custom instruction takes. Consequently, the processor must

be stalled, if the configuration of the custom instruction is not finished before the

processor calls it.

33

Implicit approach: an exception trap will be triggered when the processor detects

that the custom instruction is not in hardware. The trap handler will handle the

configuration process of the custom instruction that the processor needs. The trap

handler could run a program (Lynch, Forin and Pittman, 2006) that the software

function will be executed when the custom hardware is not configured. This

approach could remove a lot of overheads by not stalling the CPU while the

configuration is in progress. However, it could take time to handle the trap.

2.4 Design Considerations

The development of a customisable CPU on FPGAs requires the consideration of

critical system factors in order to attain the desired performance. Some of the

critical objectives that are normally taken into consideration include the speed of

the CPU, the memory, the power required and the speed with which the CPU can

access other components of the system. There is usually a trade-off between the

performance and the power required to attain such performance (Kulkarni, 2006).

The additional design considerations of a customisable configuration include the

architecture of the processor and its suitability for the targeted application. This

implies that the designer will have to take into consideration the size and type of

memory and peripheral bus. In addition, the designer will have to decide on the

model and size of the address space that is confined to the CPU, space and type

of the caches and instruction and data caches. It is also important to give

consideration to the type of controllers that are being used in the architecture.

Optional accelerators might be used to speed up the CPU (Deschamps, Sutter

and Canto, 2012).

It should also be mentioned that the operating system and the design and

development tools are part of the considerations that will have to be evaluated by

the designer. The biggest advantage of implementing the soft-core CPU using

FPGA lies in the fact that in the case of any mistake being committed during the

development phase, there is the possibility of repeating the process to reconfigure

the parameters afresh. There are no limits to the number of times the processor

can be reconfigured. This provides designers with a degree of design flexibility

(Kozyrakis and Patterson, 2004).

34

The designer will have to take into consideration the development and design

tools that will be used to develop the soft-core. The following figure provides an

illustration of the design and development tools. The design and development

tools are considered to be responsible for the parameterisation of the soft-core

and also the associated implementation of the peripherals (Kilts, 2007).

Figure 5 Design and development tools (Minev and Kukenska, 2007).

FPGAs allow extensive customisation alternatives that are not found in other

platforms such as ASIC. Additionally, an FPGA is also considered to have

optimisation techniques that help a designer to work towards achieving

performance metrics faster (Gebotys, 2002). The benefits of using an FPGA

platform in customising soft-core CPUs have also be reviewed. The development

of a customisable CPU on FPGA requires critical system factors in order to attain

the desired performance (Gebotys, 2012).

Evaluation of the design and development tools will help the designer to easily

and quickly attain the design requirements. It should also be noted that the wrong

choice of design and development tools can lead to system inefficiencies. The

design and development tools are considered to be responsible for the

35

parameterisation of the soft-core and also the associated implementation of the

peripherals (Synopsys, 2010).

2.5 Previous Work

Related work that is relevant to this project can be categorized into two parts:

instruction set extension and partial reconfiguration.

2.5.1 Instruction Set Extension

An example study regarding instruction set extension is that of Altera (2011). This

study demonstrates the ability to extend the NIOS-II CPU with custom instructions

using the SOPC builder wizard of the Quartus design tool. Integrating custom

instructions with a soft-core instruction set is a feasible way of speeding up

application execution in specific domains such as cryptography (MAJZOUB and

DIAB, 2007). Some of the issues involved in the customisation of an instruction set

were analysed in detail by Galuzzi and Bertels (2011), who provided a

comprehensive overview of instruction-set extensions.

2.5.2 Partial Reconfiguration

A fair amount of literature has been published on partial run-time reconfiguration in

the soft-core CPUs of FPGA. These studies have shown that PR reduces the size,

weight, power and cost of an FPGA system. The use of design techniques to

increase performance and resource utilisation of reconfigurable soft CPUs was

studied by Wold et al. (2012). They have investigated the appropriate instruction

implementation technique for a soft CPU which can achieve a performance

improvement, while at the same time reduce the resource requirement. It is a

different task but fairly closely related to what this project is aiming at. Their goal is

to improve soft CPUs for FPGAs using partial reconfiguration. For example, they

presented a classification method that determined the parameters for selecting the

most suitable instruction based on profiling. Instruction Set Extensions, Software

Emulation, Reconfigurable Instructions and ISA Subsetting are the optimisation

techniques used in their methodology.

Reconfigurable instructions could result in a critical side effect in terms of the

configuration time. An example of this could be stalling programme execution

36

while waiting for the reconfiguration process to complete could cause an overhead

(Wold, et al., 2012).

Another study by Koch, Beckhoff and Torresen (2010) involved an approach to

reduce this overhead. They examined the problem which occurs when the

communication needs an extra logic or the placement of reconfigurable modules

needs to be restricted to the static system which causes an additional logic

overhead. They reveal a novel tool called ReCoBus-Builder. In a case study,

modules of different sizes and latency were integrated with soft CPUs without

causing any logic overhead by using partial run-time reconfiguration. For this

project, the newer tool GoAhead, which is a fully re-implemented issue of the

tool ReCoBus-Builder, will be used. However, this study will be a library of

dynamic instruction set extension.

http://www12.informatik.uni-erlangen.de/research/recobus/

37

Chapter 3

3 System Design and Methodology

This chapter presents the methodology that has been adopted in this project, the

implementation tools and the system design.

3.1 System Development Methodology

Designing and developing such an effective customization soft-core processor is a

challenging task, especially with little experience in processor and system design.

Therefore, a system development lifecycle method and a step-by-step design

approach are appropriate. This can progressively develop a researcher’s learning

experience in this important computer engineering field and developing an

effective system using partial reconfiguration field.

Figure 6: The general approach of the system development stages (Soft, 2013).

Figure 6 shows the general lifecycle stages that were used in this project in order

to develop a processor. The requirement analysis stage has already been

introduced in the objectives section of the Introduction chapter on page 13. The

design and implementation stages used a step-by-step design and implementation

method (Elkateeb, 2011), as shown in Figure 7, and this will be discussed below in

this section. The testing and evolution stages will be introduced in Chapter 4 and

will use an appropriate approach for FPGA Embedded Processors design and

38

evaluation (Fletcher, 2005) such as comparing the system against a software

implementation and comparing with the benchmark system and others real-world

system. Finally, some techniques for optimizing the performance and cost in an

FPGA MIPS processor system will be discussed.

Figure 7: A step-by-step design and implementation method.

When using such a step-by-step design and implementation method, the

customizing soft-core processor has to be done by gradually integrating the

processor module with other system modules and developing other modules to get

the final customization soft-core MIPS processor design with the help of the partial

reconfiguration. Each of the steps is briefly described below.

First step: MIPS CPU: First of all, the soft-core is the brain of the system. A MIPS

CPU has been implemented in one module, using an XOR gate in the top level in

order to synthesise it shown in figure 8. The reason for the XOR gate is that the

MIPS CPU used more interface wires than there are I/O pins available on the

FPGA board. By XORing some of the CPU outputs, the CPU could be synthesised

for test purpose (e.g. for data mining clock frequency and resources utilisation).

Testing MIPS instructions encoding and implementation module was done by

using Test Bench in the Xilinx ISE package as is illustrated in the testing section in

chapter 5.

39

Figure 8: First step, system overview.

Second step: Custom instruction in software: A GCC cross compiler is used in

order to compile the MIPS C code. This compiler is modified to include the custom

instructions by assigning the custom instructions to unused opcodes. Accordingly,

this will be used in the instruction decoder to select the instructions from the binary

code. Installing the compiler was done using a virtual machine that was installed

on the Windows operating system.

Third step: One custom instruction in hardware: A custom module that will be

connected with the MIPS is chosen. Then, adding a “Counting One” custom

module as component in the MIPS CPU. The MIPS will detect the custom

instruction and return the result from the custom module. Moreover, the MIPS

CPU module is connected with other modules such as ROM, RAM and GPIO via

system bus.

40

Figure 9 Third Step, system overview.

Fourth step: Custom Instructions library in hardware: Four custom modules

are implemented. In addition, a Trap handler that is based on a multiplexer (MUX)

is developed (Appendix B) in order to choose one custom instruction, the one that

is called by the MIPS CPU. This approach has overhead logic costs as shown in

the result in chapter 5.

Figure 10: Four step, system overview.

Fifth step: Reconfiguration Custom instruction: There are different methods

for implementing reconfigurable custom modules in hardware as already

mentioned in the background chapter. In this project the following approaches

have been implemented.

First approach step: Improving the Trap handler: The Trap handler based on a

MUX is improved to handle the configuration process. In this approach, the trap

handler will be based on ICAP (Appendix C). It is done by implementing the trap

MIPS CPU

Module Instructions

ROM Module

Memory

Module

General I/O

Module

Custom

Module

MIPS

CPU

Module

Instructions

ROM Module

Memory

Module

General I/O

Module

CM

1

CM

1

CM

1

CM

1

MUX

System

Bus

System Bus

41

handler as a state machine which includes a table to save the addresses of the

configuration bitstreams for the different custom instructions as will be introduced

later in section 4.3 and then uses the ICAP primitive in order to load the bitstreams

into the device. We will exploit the fact that all academic boards come with serial

SPI memory that is often not used. The MultiBoot feature is applied in this project;

this allows the FPGA to load one of several configuration revisions. Spartan-6

FPGAs support two different configuration modes: BPI and SPI. The functionality

of this feature is described in detail in [Spartan-6 FPGA Configuration User Guide].

The iMACT will be used to supply the starting address for each configuration

revision in order to generate the MultiBoot SPI file (Xilinx Inc, 2015). SPI PROM is

specified to store the configuration bitstream for the different custom modules.

Consequently, if the custom instruction is needed by the MIPS CPU then the trap

handler will check if the custom instruction is already configured otherwise a

different bitstream will be loaded from an attached external memory (SPI PROM)

into the FPGA. As a result, the FPGA will be reconfigured with a different

configuration bitstream. The testbench, in the test section in chapter 5 shows the

functionality of this module.

The whole process works with full reconfiguration, with respect to the MIPS and

the extension. The reconfiguration will only make sense with partially

reconfigurable custom instructions because rebooting the whole system each time

when different custom instruction is called is not a good idea. So a different

approach comes from investigating the MultiBoot can be used for partial

reconfiguration.

42

Figure 11 Five step: system overview of the first approach.

Second approach step: Exploiting the MultiBoot feature for partial

reconfiguration. As stated in Xilinx’s Partial Reconfiguration User Guide (2012),

PR is a technique for modifying the operation of the FPGA by loading a different

bitstream while it is performing its normal operation. The whole design in this

technique is translated into different bitstreams or files, where each one defines a

separate function and is loaded upon being required. Application Specific

integrated Chips (ASIC) are fabricated in the fab and are designed to perform a

fixed functionality. On the other hand, FPGAs offer the flexibility of being

reprogrammed, and most modern FPGAs offer the capability of on-site

programming. In PR, the operation of the FPGA is modified by programming a

partial bitstream (also called bit files), which defines the operation of a subset of

the programmable blocks while in this case the whole FPGA fabric is not

reprogrammed. In such a scenario, first of all a full bit file is programmed into the

FPGA, which defines the operation for the whole FPGA. Then afterwards,

depending on the requirement of the operation, a partial bit file can be

downloaded to modify the reconfigurable parts of the FPGA and the other parts

continue to perform their operation without being affected. The conceptual

diagram of the partial reconfigurable system is shown below in figure12.

MIPS CPU

Module

Instructions

ROM Module

Memory

Module

General I/O

Module

CM

3

Reg

Trap handler

System

Bus

CM ICAP

CM CM

4

MU

CM

2

43

Figure 12 Five step: system overview of the second approach. (Xilinx, 2012).

It can be seen that there is a Reconfigurable Block A in the system, which can be

loaded with one of the possible configurations defined by several BIT files, A1.bit,

A2.bit, A3.bit, and A4.bit. The logic in the FPGA design is divided into two different

regions: reconfigurable region and static region. The dark area of the FPGA block

represents reconfigurable regions and the lighter area shows the static region.

The functionality of the reconfigurable region is defined by the partial bit files and

can be re-programmed by loading one of the partial configurations, while the static

region continues to perform its operation and is not affected by the reprogramming

of the reconfigurable region.

The method of Partial reconfiguration offers several advantages, which include:

– This approach helps to reduce the area or size of the FPGA device required to

implement a given function, which means fewer logic blocks are consumed;

hence, as a result, it also reduces the cost and power consumption of the

device.

– This approach helps to implement and test multiple algorithms or methods to

perform a specific functionality. In such a case, multiple implementations can

be loaded turn by turn and can be compared against each other.

– This technique enhances the design security as specific user dependent

keywords or codes can be included into the reconfigurable region and

reprogrammed by the end user.

– This approach enhances the fault tolerance in the FPGA design, where any

malfunctioning regions or parts can be reprogrammed by the user and can be

debugged.

44

– This approach enables the designer to divide the complete design into multiple

regions or blocks, and these blocks can be added to the FPGA design

incrementally; hence, it speeds up the FPGA design and verification process.

In our partially reconfigurable system, there is a partial reconfiguration controller

implemented in the static region. This partial reconfiguration controller is used to

retrieve the partial bitstreams from any memory connected to the FPGA, and then

forwards it to a configuration port. There are two possibilities for the partial

reconfiguration controller; either it is implemented in an external device such as a

separate processor or in the static region of the FPGA design. In the case of the

partial reconfiguration controller being located inside the static region of the FPGA,

the partial bit files are loaded using ICAP interface. Like the other logic in the static

region of the FPGA, the partial reconfiguration controller logic functions without

being affected by the programming of partial bit files.

The fundamentals and the concepts of the partial reconfiguration for any system

design are discussed above. However, nothing in the documentation provides

information on using the ICAP primitive to send the command sequence for

loading configuration bitsreams in MultiBoot feature for partial reconfiguration.

From this point, partial reconfiguration is applied. The code will be changed to

include a black box that presents the custom instruction wrapper later in order to

perform the down to top syntheses, which is the important concept when

implementing partial reconfiguration. Figure 14 in section 3.3 in page illustrates

this approach.

3.2 Implementation Tools

3.2.1 Hardware Description Language

The circuit for an FPGA is developed using a Hardware Description Language

(HDL). The two most popular hardware description languages used for FPGAs are

Verilog and VHDL. Hardware description languages are used to design circuits

and they are used to capture the complexity of large circuits and they can

significantly increase the productivity of the design process (Wold, et al., 2012). In

short, a hardware description language can be compared to an imperative

45

programming language. However, there are many fundamental differences

between the two programming languages. Normal programming languages are

used to create programmes that are executed by microprocessors. However,

hardware description languages are designed to produce hardware circuits. They

are capable of describing circuit hierarchy and connectivity, providing a built-in

mechanism for simulating circuit behaviour in the software and expressing the

inherent parallelism of separate circuit components (Hauck and Wilson, 1999).

3.2.2 Xilinx ISE (Xilinx, 2013) :

This is an Integrated Synthesis Environment software tool that is provided by the

FPGA vender Xilinx. It is used for the synthesis and analysis of HDL designs and

enables a designer to compile their HDLs designs, (such as VHDL and Verilog

file), to perform timing analysis, to view RTL schematic, to simulate a behavioural

model, and to generate bitstreams for FPGA to configure the target device.

By using a VHDL programming language, different levels of abstraction are

supported by the hardware description languages. The commonly applied

abstraction levels include behavioural and structural modelling. A module is

considered to encapsulate a circuit by defining its interface. In this way the circuit

is able to communicate to the outside world through the input/output ports.

Modules are comparable to classes in object oriented programming. The modules

are normally defined and then instantiated several times. Different instantiations of

the modules can be executed simultaneously and they can also be connected,

mapped and routed using the signals that link their inputs and outputs.

ISim simulator: Hardware description languages are normally associated with

simulation features which provide an insight into the functionality of the circuit

when fabricated. This helps to reduce the risks and costs that are associated with

real fabrication processes. Simulation is normally considered to be crucial in the

implementation and design of hardware circuits. They are both economical and

practical. There are different levels of granularity that are supported for the

simulation of a circuit. The initial stage of simulation seeks to determine the

behavioural correctness of the circuit. In this case, an appropriate benchtest is

generated and introduced to the circuit. The results of such a benchtest are

already known before the simulation. The simulation results obtained are

http://en.wikipedia.org/wiki/Bitstream

46

compared to the expected results and the comparison can be used to assess the

correctness of the designed circuit

3.2.3 Cross compiler:

A cross compiler is a compiler which generates code that can be run on a different

system, for example, compiling C code for MIPS architecture (Gnu.org, 2015). For

this project, a GNU cross compiler will be adapted to use reconfigurable

instructions through inline assembly calls.

3.2.4 FPGA Platform:

A Nexys3 digital circuit development platform which is based on the Xilinx

Spartan-6 LX16 FPGA was used, and is shown in Figure 13. The Spartan-6 FPGA

will be used for implementing reconfiguration ISA extension. This provides high

performance at low resource cost. It includes the following features (Digilent,

2013):

– 2,278 slices each containing four 6- input LUTs and eight flip-flops

– 576Kbits of fast block RAM

– two clock tiles (four DCMs and two PLLs)

– 32 DSP slices

– 500MHz+ clock speeds"

Figure 13 Xilinx Spartan-6 LX16 FPGA platform (Nexys3™ Board Reference Manuall, 2013).

http://en.wikipedia.org/wiki/Compiler

47

3.2.5 GoAhead

A tool for implementing partially reconfigurable systems is GoAhead. This tool

supports all of the recent Xilinx FPGAs. It provides some features that the Xilinx

PR tool chain cannot perform, including (Beckhoff, et al., 2012)

– Implemented partial modules that will be completely independent with respect

to the static design.

– Modules that can be relocated and the multi-modules that can be instantiated

– Modules can be integrated without any logic overhead "no bus macro or proxy

logic required ".

– It will provide Hierarchical reconfiguration which allows the implementation of

a PR module inside a PR module.

– Communication architecture generation that enable multiple PR modules to be

hosted simultaneously in the same PR region.

3.3 System Design

In this project, focus is put on embedded systems that have different requirements

in various application domains such as cryptography, network control systems and

image processing. This is due to the fact that an FPGA platform is the most

suitable device to adapt to changes in application requirements (Koch D, 2013).

There are four custom instructions that have been considered as extensions for

the MIPS processor and they are described below:

I. Count ones: Counts the number of ones in a 32-bit vector.

Counting the set bits in the vector is a common algorithm, called Hamming

Weight, and it is used in cryptography and network domains. For example, in a

Hamming distance algorithm, in order to detect the number of bit errors between

two binary numbers, the detection will be obtained by applying XOR gates to them

and then counting the one numbers and the result will be the number of bit errors

(Schiller, 2003).

II. 32-bit CRC: Takes two 32-bit operands and computes a CRC.

A cyclic redundancy check (CRC) is one of the most popular error detection

methods used in networks and in storage systems. It is very useful to detect any

48

errors that have occurred because of the noise in the transmission channel in the

network. For example, the same number between the transmitter and receiver will

be used to detect the error. The CRC calculation will be done in both of them and

the result should be zero if there is no error. CRC calculation can be obtained

sequentially by a shift register and XOR gates or in parallel with XOR gates only

(Schiller, 2003).

III. Leading zero: Adding zero bits before the first one bit in MSB in a 32-

bit vector.

This is computing the preceding number of a bit vector that has zero bits in the

most significant bits (MSB) of the vector. It is often used for electronic digital

display devices as seven-segment display on the devices for example, or for

ascending order of numbers or for preventing fraud in financial documents (Miller,

2004).

IV. Parity: counting the number of 32-bit vector to generate the parity bit.

This is one of the simplest and most popular error detection methods. It could be

used as a special case of CRC, when 1-bit CRC is considered, or it could be used

with other methods such as Hamming Weight to calculate the Hamming distance

as mentioned above, because it uses only a number of XOR gates to calculate it.

As a result, the output vector will include a parity bit at the last significant bit (LSB)

in the 32-bit vector that generate it by using XOR gates in order to indicate

whether the number of bits in the vector is even or odd (Schiller, 2003).

3.3.1 System Definition and Scope

The overall project is comprised of two parts. One is the implementation of a

custom instruction module library, where we implement custom modules for

different operations like CRC, Ones counter, parity etc. The other part is the

implementation of the PR region of the FPGA, which is used to reconfigure the

reconfigurable region according to the requirements.

3.3.2 System Architecture and Components

The overall system is divided into two main regions, the static region and

reconfigurable region, as show in the Figure 14. The static part includes all of the

major logic and the reconfigurable region only includes the custom module. The

MIPS CPU is the main controller processor of the system and it fetches

49

instructions from the instruction ROM. The MIPS CPU decodes the instructions

and performs the desired operations. When the MIPS CPU encounters an

instruction which is not implemented in its datapath, it will start a hardware trap

handler and send the opcode of the desired operation to the trap handler. The trap

handler will look at the opcode and check if the desired instruction is already

loaded into the custom module and performs the operation. If the desired

instruction is not loaded into the custom module, then the configuration manager

inside the trap handler will load the partial bitstream using the ICAP primitive and

hence a new partial bit file will be loaded into the reconfigurable region and then

the operation is performed. The whole process is carried out in hardware to

achieve the lowest latency for reconfiguration.

Reconfigurable RegionStatic Region

MIPS CPU

Trap handler

ICAP

Controller

State

machine Custom module

(reconfigurable)

External

Memory(outside

FPGA)

Instruction

ROM

Figure 14 The final system design.

The system operates on a 50 MHz clock, deriving internally from a top level clock

using Global buffers BUFG to allow accessing of the clock in high speed and to

provide the least amount of skew possible between the MIPS and the peripherals,

connected to the bus that physically located in large distances.

MIPS Soft-Core Processor

The CPU core is based on the MIPS I instruction set and is built in the system as a

soft-core processor. It is used as a platform demonstrator for reconfigurable

50

instruction extensions. Moreover, it is the main module that will control all the

different modules and it will run a trap when the custom instruction exception

occurs. The following Figure 15 illustrates the MIPS overview.

Figure 15 The non-pipelined MIPS shows the most important signals and logics (Fritzell,

2013).

Peripheral component modules:

– Memory RAM: A static memory that provides write-before-read behaviour. In

other words, the data being returned, during a write-cycle, is the same as that

being written. The memory module is synthesised into internal block memories

in the Sparton-6 FPGA architecture. (Doulos.com, 2015).

– GPIO: General-purpose input/output (GPIO) that includes any connection with

an input or output pin. The user at run-time can have control of them. GPIO

pins such as LEDs and switches go OFF by default (Fritzell, 2013).

– ROM: this module will contain the machine code of the instructions, using the

ROM’s address as an index into this memory. The machine code will be

generated with the help of a GCC cross compiler that compiles the C code

and runs the assembly to produce the binary code that can be used in this

array.

51

– UART: universal asynchronous receiver/transmitter (UART). A UART module

can be added to the system. This unit allows the user to control the operation

of the MIPS CPU, the trap handler and other modules and allows them to

check the status of the system. Additionally, the UART module can also be

used to load the configuration required by the ICAP module

– System bus: all modules are connected via a baseline bus protocol, consisting

of: Chip select (CS) input signal, Write enable (WR_en) input signal, Address

input signal, Writedata input signal and Readdata output signal, with the MIPS

as the only master module (Fritzell, 2013).

Configuration controller module:

The Trap handler

The Trap Handler is a core module and is located in the static region of the

FPGA design. The trap handler is directly connected with the MIPS CPU with a

bus, this module can be easily modified such that multiple CPUs can use it to

load the configuration at the desired places and run the operations. Whenever

the MIPS CPU encounters an instruction which is not implemented in its

datapath, then there are two options: either to have a stall or trigger the trap

handler. The trap handler is implemented so as to avoid the malfunction of the

CPU due to the non-implemented instruction.

The ICAP primitive

As we are using Spartan-6 FPGA, the ICAP primitive is used to initiate the

configuration process (called ICAP_SPARTAN6). It is implemented in the

FPGA's fixed logic. This primitive can be used to program the FPGA logic by

user control. Figure 16 shows the interface diagram of the ICAP Spartan-6

primitive and Table 3 gives the detailed description of the input and output

ports of the primitive.

52

ICAP_SPARTAN6

Clk

CE

WRITE

I[15:0]

O[15:0]

Busy

Figure 16: ICAP Primitive (Xilinx Inc, 2015).

Table 3 Descriptions of using ICAP_SPARTAN6 Port (Xilinx Inc, 2015).

Custom modules

There are four custom instructions implemented in the design. The instructions

are: CRC-32, Ones Counter, Parity flag and Leading zero counter. The concept of

each custom instruction is taken from different sources, for example, using the

CRC generator to generate the CRC-32 custom instruction module

(Outputlogic.com, 2015).

Each implemented module is assigned a CUSTOM ID, which makes it

differentiable from the others. More custom instructions can be implemented and

added to the systems by assigning a unique CUSTOM ID to each of the custom

instruction as in Figure 17.

53

The CUSTOM ID is evaluated by the instruction decoder of the MIPS CPU in order

to run the corresponding module or to trigger the configuration process through

the hardware trap handler.

Figure 17 Custom Module.

54

Chapter 4

4 Implementation

This chapter discusses the implementation aspect and the technical issues and

the challenges faced.

4.1 Baseline MIPS Soft-Core

Because the implementation of the soft-core is often sophisticated and comes with

many design files, the implementation of the CPU core in the system has been

done by using the same implementation idea of the MIPS that was proposed by

Fritzell (2013), this is done in one HDL file. It is modified to support dynamically

reconfigurable module.

Pros and Cons

There are several advantages of the simple implementation style for the MIPS

CPU in the system. The MIPS CPU will be of a small size and will run at 50 MHz

and will deliver 50 M instructions per second and that could be more than many

micro-controllers. Moreover, The CPU will trap the custom instruction if it is not

available and return the correct result automatically to the register file; that is, it is

combined with a trap handler that handles the configuration process in a smart

way.

One disadvantage is that the CPU will become an application-specific processor if

the customisation extensions are considered, but the MIPS CPU itself will be just

used as an advanced state machine for the configuration controller. In addition, if

the MIPS is still too large for the application or the application needs to increase

the execution speed then reducing the memory size and removing unused

instructions could be a solution (Yiannacouras, et al., 2006).

One instruction per cycle

Despite the fact that most RISC CPUs are 5 pipelined stage designs, computing

one instruction per cycle in the pipeline stages will require handling hazards in

each stage by adding the corresponding control logic.

55

The non-pipelined MIPS in this project will execute one instruction per cycle

without the need for any hazard detection and handling. The VHDL code, in

Appendix A, of the non-pipelined MIPS highlights the main important blocks which

are: Instruction decoder, Register file, ALU and Program counter. These can

handle the execution of one instruction in one clock cycle. So, the MIPS code has

long propagation delay paths between flip-flops which needs to be minimised in

order to achieve high clock frequency.

In order to allow the execution of one instruction per cycle we use a ‘trick’ to avoid

waiting one clock cycle to get the instruction memory output. The instruction

memory will receive the address of the next instruction just before starting the next

clock edge. As a result, this will make the instruction word available at the

beginning of the current clock cycle. As shown in Figure 18, the next PC address

will pass to the instruction memory instead of the PC, because reading the

instruction from a BRAM should be done synchronous to avoid one clock cycle

reading delay. In this case the time could be affected since the address of the next

instruction that is done after a long linking path that has to meet setup

requirements on the input of the instruction memory

Figure 18 The Program Counter process overview that consists of extra logic and flip-flops

to handle branch and jump instructions. (Fritzell, 2013).

56

Delayed Branch

Delayed Branch is a technique that is applied in order to avoid the effect of control

dependency “hazards” in a pipelined MIPS and it is used in non-pipelined MIPS to

handle branch and jump instructions as already shown in Figure 18. If the branch

is taken, the next instruction that follows the branch address instruction will be

executed before branching or jumping to the new address. By adding extra logic

and flip-flops we can handle the branch address when the delay slot is performing.

Because of that, the MIPS code after the branch or jump instructions often

executes NOPs instructions.

Instruction encoding

The MIPS VHDL code will start when the instruction word is decoded, following

the instruction set encoding that is provided in the MIPS32 instruction set

reference manual (MIPS Technologies, 2003), in order to provide the data that

can be operated on by the ALU. The output result will be stored in the register file.

Multi-cycle instructions

Most instructions are implemented in a straight forward manner; that is, they are

executed in one clock cycle. However, there are some instructions that have a

critical path in the code and they could affect the timing and performance.

Therefore, they should be implemented as multi-cycle instructions. Examples are

signed and unsigned multiplication and division instructions. Because division

instructions could be resource expensive and seldom used, the undefined

instruction will be considered when the div instruction occurs. This is, however, not

a problem as we can add it as a custom instruction, software function or multi-

cycle instruction if we need it.

The multiplication instruction is implemented by enabling the DSP-blocks in the

synthesis tool. This is done in order to take full advantage of device resources and

to increase the performance by allowing the implementation of multiplication in

DSP-blocks. Figure 19 illustrates the multiplication that can operate on extra

registers called HI and LO (div instruction would use HI and LO registers too). This

operation was achieved by using the constraints editor in ISE to constrain the

combinatorial path assignments between the instruction memory output and the HI

and LO registers inputs to allow the path multi-cycle operation in hardware. Also, a

stall signal is used when performing multi-cycle execution in order to stall the

57

MIPS CPU during the execution of the multi-cycle. As result, two clock cycles (or

more if needed) will be performed when the multiplication instructions are

executed. To allow this, we have to prevent the PC and register file from being

updated for one cycle (i.e we have to stall the CPU)

Figure 19 Datapath for the multiplication, allowing two clock cycles for execution.

(Fritzell, 2013).

Trap instructions

When the instruction is available, the result will be returned to the register file.

However, when the instruction is not available but is defined as a custom

instruction, then the MIPS CPU will trap this instruction to be processed by the

trap handler.

4.2 Custom Instruction in Software

Supporting custom instruction in software has been done by changing the GCC

cross compiler for the MIPS architecture. The encoding of the MIPS I instruction

set can be found in C code inside the binutiles that hold the opcodes folder of the

compiler.

The mips-opc.c source file has all the assembly instructions defined in the MIPS I

instructions set in addition to the range of UDIs (user defined instructions). The

format of the UDIs is similar to the format of the R-TYPE instructions that were

defined in section 2.2.3 . Therefore, the UDIs instructions share the same opcode

and are distinguished y the function field, from 0x70000010 to 0x7000001f

instruction word range. In totl, 16 individual user instructions are unused. So, the

designers could add an additional 16 instructions directly to a system.

58

In order to implement the custom instructions in software, the instruction encoding

of any instruction from the UDIs instructions range only should be used. By

exploiting the similarity in the format with R-TYPE instructions, one of the user

defined instructions can be modified to the same R-TYPE format such as XOR

instruction as the following steps (Fritzell, 2013):

Coping the XOR instruction:

{"xor", "d,v,t",0x00000026, 0xfc0007ff,WR_d|RD_s|RD_t,0,I1 },

Choosing any UDI instruction such as the following:

{"udi0", "s,t,d,+1", 0x70000010, 0xfc00003f, WR_d|RD_s|RD_t, 0, I33 },

A small change to the UDI instruction name to be CUSTOM by modifying

the format to be the same XOR will be done:

{"custom", "d,v,t", 0x70000010, 0xfc0007ff, WR_d|RD_s|RD_t, 0, I1 },

As shown in Figure 20 and then recompiling the GCC cross-compiler with

the new custom instruction.

Figure 20 Adding Custom instruction in the compiler.

Then the following inline assembly will be used inside the C code, in order

to call the software implementation of the custom instruction.

__asm__ ("nop\n\t"

"custom %0, %1, %2\n\t"

:"=r" (z)

:"r" (x), "r" (y));

Note: x, y and z are the input operands and the result respectively.

59

4.3 Configuration Controller Modules

I. Trap handler

The Trap handler is the module to handle the exception encountered by the MIPS

CPU. The MIPS CPU reads the instructions from the instruction ROM and then

decodes them. After this, it executes them. In the case that the instruction

received is not implemented in the MIPS CPU, an exception is generated. Then

the MIPS CPU requests the trap handler to handle the exception. The operation of

the trap handler is controlled by a state machine. Figure 21 shows the state

machine diagram for the trap handler.

ST0

ST1

ST2

ST3

Trap_start =1

Trap_start =0

Count = 13

Count < 13

Custom_done = 1

Custom_done = 0

Trap_start =1 &

Opcode =

CUSTOM_ID

Figure 21 Trap Handler State Machine.

There are four states in the state machine. ST0 is the reset state and the system

is normally in this state. Here it waits for the trap start signal, which comes from

the MIPS CPU. When an exception occurs inside the MIPS CPU, it will send the

trap start signal to the trap handler. On the reception of this signal, the state

machine moves to either ST1 or ST2. If the requested Opcode is equal to the

currently loaded CUSTOM ID, then there is no need to load the partial bit file so

the state machine moves to ST2. In the other case, the state machine moves to

ST1, where it sends the command to the ICAP primitive to load the partial bit file

inside the custom module. The configuration process is typically thousands of

cycles so we use a counter in order to monitor the configuration reading signal

from ICAP before going to ST2. At ST2, the trap handler will send a start signal to

the custom module and in ST3, it will wait for it to complete the operation.

60

Each custom module is assigned a unique opcode and the address, which are

given in the below table 4.

Table 4 Custom instructions’ address and ID

II. ICAP primitive

There are multiple ways to use the ICAP controller. If a UART for a connection to

a host machine is considered, then the ICAP controller will be dependent on the

user to initiate a UART transaction for reconfiguring the FPGA. However, the more

automatic way is to load a configuration into SPI flash, and get them from there. In

this project, the latter option is used. An ICAP primitive is instantiated inside the

Trap handler to allow us to load the configuration files so that the reconfigurable

region is reprogrammed to the desired logic. As described by Xilinix Inc

“Spartan-6 FPGAs have dedicated MultiBoot logic, which is used for

both fallback and MultiBoot (IPROG) reconfiguration. When fallback

or IPROG happens, an internally generated pulse resets the entire

configuration logic, except for the dedicated MultiBoot logic. The

IPROG (internal PROGRAM_B) command can be sent through

ICAP_SPARTAN6 or the bitstream” (2015).

Custom Module Name Opcode Address

CRC-32 010000 X"100000"

Ones Counter 100001 X"200000"

Parity 010001 X"300000"

Leading Zero Counter 100000 X"400000"

61

Table 5 An example of bitstream for the IPROG command using ICAP (Xilinx Inc, 2015).

The sequence of command as illustrated in the table above is described in detail

in Spartan-6 FPGA configuration user guide (Xilinx Inc, 2015). After the IPROG

command is sent to the configuration logic, the FPGA will reset everything except

the dedicated reconfiguration logic. Then the bitstream value in the starting

address will be loaded. Thus, the static region is not affected by this operation.

4.4 Custom Instruction in Hardware

Figure 22 Custom Instruction (CI) act as extension of the ALU

OP_

A

OP_

B

CI ALU

ALU_out

Instructio

n

RES

62

Figure 22 shows the MIPS CPU with custom instructions as extensions to the

original ALU. It could take one or two 32-bit input operands and one 32-bit output

is computed. Adding custom instructions to the system can speed up the

execution time of an application as mentioned above. Run-time reconfigurable

accelerator modules in a PR region with a proxy logic approach for the

communication have been implemented using the GoAhead tools.

Figure 23 illustrates the communication between the static and the partial

reconfiguration module. Proxy logic will be used as a connection primitive which is

nothing else than a look up table in route through mode. It acts as a placeholder

for the non-existing part of the system; that is, it replaces the partial module when

implementing the static system and it replaces the static system when

implementing reconfigurable custom instruction accelerator. The same wires are

used for the communication between the static system and the reconfigurable

area.

Figure 23 also shows that the different custom instruction modules use different

logic, but have exactly the same interface to the CPU (including the routing).

Figure 23 On-FPGA Communication for Custom Instructions.

OP_A To

CI OP_B To CI

RES From

CI Custom

Instruction

OP_A From

CPU OP_B From

CPU RES To CPU

CPU

Static Part Partial

Reconfiguration

Proxy

Logic

OP_A To

CI OP_B To CI

RES From

CI Custom

Instruction

OP_A To

CI OP_B To CI

RES From

CI Custom

Instruction

OP_A To

CI OP_B To CI

RES From

CI Custom

Instruction

63

Static System Implementation

A screenshot of the static system is shown in Figure 24. It shows the operand

signals (OP_A, OP_B) in the left side and the result signal is collected at the right.

The amount of wires that are connected from the static part of the system to the

PR region is four for the connection primitive. Consequently, it takes 8 connection

primitives for each of the 32-bit interface signals (OP_A,OP_B and RES).

64

Figure 24 Static implementation

OP

OP

RE

65

Reconfigurable Instructions

Implementing the reconfigurable modules in the absence of the static system is

done as can be seen in the screenshot in the Figure 25. For the partial module

implementation, the same primitive will be used with the other side which is not

connected yet OP_A to CI and OP_B to Ci and RES_from CI. The figure 25 shows

the CRC module connects where the static design ends by the proxy logic. The

custom instruction wrapper has been auto generated by the GoAhead tool.

As the output of the result is not connected to the outside word (i.e. the path ends

at the connection primitives), the FPGA tools would typically remove all logic and

routing to the output primitive. This will eventually result in an empty design to

overcome this, all interface signals were set with a keep attribute (which is specific

to the Xilinx vender tools).

66

Figure 25 Partial Part: the example shows the implementation CRC instruction.

67

Using the GoAhead tool

GoAhead provides a GUI as well as a scripting interface. A screenshot of the tool

is shown in Figure 26. The GUI is typically used to create scripts. The script will

then generate all the constraints that are needed in the system. The generating

constraints for this system are used for two important jobs. The first one is to

prevent the use of the resources in the PR region. In other words, the routing will

be blocked inside the PR region and no logic primitives will be used. Another job is

to create connection primitive placement constraints.

The following steps are used for both implementations (Static and Partial) with

GoAhead as illustrated in the screenshot of the Figure 26:

1. Device description will be loaded

2. Define the region in GoAhead. By selecting the elements between 72 and

79 it is exactly 8 elements which is 8 routes.

3. Place connection macros inside the PR region by using the macro placer in

GoAhead.

4. Create the connection primitive into this area. 4 input wires for connection

primitive that way it creates an area with 8 tiles (i.e. CLBs).

5. All routing inside the PR region will be blocked, except the operands and

result vectors. Then the blocker is exported to the XDL, which is a Xilinx

specific netlist format that is not further investigated in this project.

6. Instantiate the connection macros as in Figure 27. The name of the

primitive is "OP_A connect "and then it has input "OP_A from CPU” as the

VDHL name.

7. Then the constraint file for the design (UCF-file) with placement constraints

for the PR region which is generated by GoAhead should be updated.

In order to generate the bitstream, the static and the partial implementations

should emerge together. It could be done by copying the text description of XDL

netlists and merging them together.

68

Figure 26 GoAhead GUI.

The graphical user

interface of the GoAhead.

Figure 27 GoAhead Script.

69

4.5 Challenges during Implementation

– The three month duration working on this project was a major challenge. In

addition, working on different phases and tools and spend couple of weeks to

learn each one.

– GCC cross complier for MIPS in Windows is not a straightforward task and it

takes time to setup.

– The Nexys3 platform, that hosts the system, does not have external interfaces

such as audio and video which causes limitations in the usability of this

device. Moreover, the difficulty in testing the system was due to the high clock

speed. GPIO and UART are very slow.

– Because the Nexys3 SPI model is not clear and it is not in the documentation,

testing the reconfiguration was difficult. I had to spend a couple of days and

tried to run the code on that board. But later on I had to change the board and

try the code on a new board. I had to change the IO configuration of the board

to run the code.

– Multiboot feature with partial reconfiguration. This is a new approach that

never implemented before I had to go through several literatures and had

spent couple of weeks learning this feature.

– With implementing partial reconfiguration, each design has different names

for the primitives and that way it is not completed

70

Chapter 5

5 Testing, Results and Evaluation

This chapter presents simulations and test of the system, results and evaluation.

5.1 Testing

The whole system is simulated using Test Bench in the Xilinx ISE package. Figure

27 shows the functionality of the MIPS CPU, reading the instructions from ROM

and decoding it, and incrementing the address in the program counter and

executing the branch delay.

a)

b)

71

c)

Figure 28 Test-bench of the MIPS CPU and ROM all pictures above a, b and c are

presenting one test bench that shows different signals for example A) instruction

encoding, decoding and ALU functionalities b) Program counter functionality and c)

branch delay and ROM functionalities.

Modalism Simulation of the custom instruction modules

The simulations for the custom instruction modules were created and the

functionality of different custom instruction modules is verified. Figure 29 shows

the simulation results of the CRC-32 module. Here it can be seen that when

crc_en is high then the CRC-32 is generated and output on the crc_out bus.

72

Figure 29 Modalism Simulation of CRC-32 Module.

Figure (30) shows the simulation results of one counter module. Here it can be

seen that the data is given to the data_in bus and is toggled after the intervals of

the clock and the corresponding output is generated on the output bus.

Figure 30 Modalism simulation of One Counter Module.

Figure (31) shows the simulation output of the parity generation module. Here it

can be seen that data is given into the data_in bus and is changed on the intervals

of the clock and in the result the output is generated on the output bus.

73

Figure 31 Modalism Simulation of Parity generation module.

Figure (32) shows the simulation results of the leading zero counter module. Here

it can be seen that the data is given to the data_in bus and is changed on the

interval of the clock, in the results the output is generated on the output bus.

Figure 32 Modalism simulation of Leading Zero Counter Module.

Modalism Simulation of the Trap handler

For the simulation of trap handler, two simulations are performed. The first

simulation is for the Mux based trap handler and the other simulation is for the

ICAP based trap handler. Figure (33) shows the simulation output of the Mux

74

based trap handler. Here is shown how this module performs when opcode and

data is changed on the input.

Figure 33 Modalism Simulation of Mux Based TrapHandler.

Figure (34) shows the simulation of the trap handler module. Here you can see

that the state machine starts moving after the trap_start signal, then it sends a

command to the ICAP primitive and when it is complete, it starts the custom

module.

Figure 34 Modalism simulation of ICAP based Trap Handler.

75

Software and Testing

The following C code is compiled for MIPS using cross-compiler, the introducing

machine code is used in ROM.

/* read switches and write to leds*/

#define LEDS_BASE_ADDERSS 0x10001000

#define SWS_BASE_ADDERSS 0x10000010

#define RESET_BASE_ADDRESS 0xBFC00000

int main()

{ int temp = 0;

int * RED_LED = (int*)LEDS_BASE_ADDERSS;

volatile* SWITCHES = (int*)SWS_BASE_ADDERSS;

while(1){

temp = *SWITCHES;

if (temp == 8)

*RED_LED = ~0x80;

else if (temp == 7)

*RED_LED = ~0x40;

else if (temp == 6)

*RED_LED = ~0x20;

else if (temp == 5)

*RED_LED = ~0x10;

else if (temp == 4)

*RED_LED = ~0x08;

else if (temp ==3)

*RED_LED=~0x04;

else if (temp ==2)

* RED_LED=~0x02;

else if (temp ==1)

* RED_LED=~0x01;

else

*RED_LED=~0x00;

} return 0; }

76

Test reconfigurable modules:

Testing the reconfiguration process is done by selectively uploading the

configuration bitstreams for the four custom instructions and the difference

bitstream into the SPI storage by using iMPACT as illustrated in Tapp (2010). The

documentation from Xilinx (Configuring Xilinx FPGAs with SPI Serial Flash) shows

the steps in details. Each configuration bitstream is assigned to a specific region.

In this project, the Nexys3 development board should be used for the test as it is

the host of our system. However, because the Nexys3 SPI model port was difficult

to read and not clear in its documentation, Atlys development board is used for

this test.

Because the Nexys3 board does not have the external interfaces such as audio

and video, the only input and output result for this system is the GPIO module,

Moreover, a UART module is not considered. Therefore, testing using only the

GPIO module was not convenient due to the high clock speed. One way to do the

test is by implementing the ICAP_SAPRTAN-6 with the multiboot feature with a

simple system such as the basic logic gates (AND, XOR, NOR…etc.) that can

read the switches as inputs, using the logic to display the output as LEDs to see

the differences when the module is changed to another logic module.

5.2 Results

The cost of the system resources for the first approach and the cost of the system

resources for the final system approach are outlined in Table 6.

Approach Nr of LUT Nr.of Slices Latency

MUX based trap handler 246 1798 20.011ns

ICAP based trap handler 438 1370 18.125ns

Table 6 Resource requirements for Configuration controller.

The results reveal that when using a MUX based trap handler. The system used

less look up tables than the ICAP-based trap handler due to the simpler datapath

in the ICAP variant. However, the slice resources that are used in the MUX-based

trap handler system will be more than those used in the ICAP-based trap handler

77

because the system uses more logic for the custom modules. Finally, the latency

is higher in the case of the system that is based on the MUX-trap handler because

of the trap overhead. However, in the ICAP system, unless one custom instruction

is configured in the system and only in the case of the custom instruction not the

desired one then the reconfiguration will be considered. Note that the delay in this

table is for the whole implementation without considering the reconfiguration

overhead. Only by introducing the ICAP-based trap handler, were we able to run

the system at the target 50 MHz clock frequency

For the custom modules, the following table shows the cost of the resources.

Custom

Module

Nr. Of

LUT

Nr.of

Slices

Latency

(Max/av)ns

Bitstream

size (KB)

CRC 32 43 18 8.038/3.597 282

Counting One 39 19 15.717/14.35 263

Leading Zero 19 15 9.723/3.597 293

Parity (XOR) 7 6 3.618/3.597 282

Table 7 Resource requirements for Custom modules.

The results in Table 7 show the implementation costs for the custom instructions.

In the progress report, manual code optimization was performed in order to see if

the tools recognize the optimization by itself or not and the result shows the tools

do not do that. This point was considered when we implemented the custom

module. Therefore, the result shown in Table 7 shows the better use of the

resources, delay and bitstream size for each custom module after manually

optimising each module.

5.3 Evaluation

System performance

The whole system, including the configuration controller, can run at a system clock

of 50 MHz. The first of the two biggest limitation factors is that the MIPS CPU runs

a trap when a custom instruction exception occurs and traps have a tiny additional

78

overhead which would not occur in case of a baseline MIPS implementation. The

second factor is that the trap handler represents the configuration controller, which

uses external flash memory.

For partial reconfiguration, one important benchmark is the response time that has

to be considered for the reconfiguration process. Swapping instructions will

obviously take a significant amount of time for loading the corresponding partial

bitstream from an external SPI memory to the device. Moreover, the bitstream size

would affect the speed of the configuration module.

In Fritzell (2013), the configuration controller for module relocation was designed

to use two clocks, one clock running at 50 MHz for the part that was connected to

the bus and the other one running at 100 MHz for the part that handled the

configuration process. In our system, the trap handler will run at 50 MHz, which

could slow down the configuration speed. Moreover, in Fritzell (2013) a

decompression module is used to decompress the configuration data on the

FPGA for faster reconfiguration. So, our predicted result of the reconfiguration

time could be lower than what is achieved in that work.

However, there are some techniques that could be applied to optimize the

performance and cost in the system on the FPGA device. In this project, we used

the FPGA MultiBoot feature that is slow, but that uses a serial configuration

memory chip that is underutilized in most FPGA prototyping systems. This also

separates the configuration bitstream storage from other memory which improves

the security of the system.

Performance Enhancing Techniques

General speaking, performance techniques could be divided into: techniques that

are not FPGA specific from compiler and memory usage to name a few; and

techniques that are FPGA specific, such as increasing the operating frequency. As

a rule of thumb, since optimizing configuration speed is a typical goal, an entire

program should rarely be targeted at external memory (Fletcher, 2005) if so, then

the use of another clock should be considered in order to handle the process

faster than it would be.

79

Comparing the system to a real-world system:

Processor Processor Type Device Family used Speed(MHz)

Achieved

PowerPCTM 405 hard Vritex-4 450

MicroBlaze soft Vritex-II Pro 150

MicroBlaze soft Spartan-3 85

MIPS soft Spartan-6 50

Table 8: comparison between Xilinx Embedded processors with our soft-core and their

Performance.

The available embedded processors with the manufacturers quoted maximum

frequency and our soft-core, included the extension with its maximum frequency

are summarized in the Table 8. Despite the MIPS processor being the slowest in

that table, it might outperform the others due to the use of custom instructions.

Hardware acceleration

A soft-core on the FPGA will allow the designer to make a trade-off between

hardware and software in order to maximize efficiency and performance. If there is

a software function identified as a software bottleneck, then a custom module can

be designed for this function in the FPGA. The device will then act as a co-

processor or, as in our case, as a custom instruction extension to the soft-core

processor.

One way to evaluate custom instructions in hardware implementation is to

compare them against software implementations of the functions running on the

standard ISA of the MIPS CPU. The software functions that are used as a

reference can be found on (Andersen, 2005). Software evaluation for those four

functions, which are written in C code, is compiled for the MIPS using a GCC

cross-cross compiler. Using disassembly for the code in order to calculate how

many instructions each function is consuming. Table 9 shows how many CPU

instructions are saved by using a custom instruction.

80

Software function Instructions

CRC 262

Hamming weight 262

Leading Zero 294

Parity (XOR) 263

Table 9 Software requirements.

81

Chapter 6

6 Conclusions and Future Work

6.1 Conclusions

The system is improved through the lifecycle that is presented in the methodology.

The final system after all improvements had been done meets the objectives

outlined in the introduction chapter. Moreover, learning the concepts and the

fundamental features of FPGAs step by step is the biggest achievement. The

previous chapters described those concepts in detail, the necessary components

and tools and the implementation of a fully functional PR system. The dynamically

run-time reconfigurable custom instruction set extension of a MIPS CPU can be

replaced in the system. The most important part of the implemented system are:

1. MIPS CPU.

2. Trap handler, included ICAP primitive.

3. The exploitation of the MultiBoot feature for the full and partial

reconfiguration.

6.2 Future Work

There are some improvements that can be done to the final implemented system

and together these could be considered as the requirement analysis stage for the

next lifecycle.

In this project Nexys3 has been used as a platform. However, the lack of

external interfaces caused limitations in the usability of this device. Using

another academic board which includes audio and video then could show

the input and the output of the system and could design a complete digital

system built around soft-core processor.

The MIPS CPU that is used as soft-core is a very simple processor, is non-

pipelined and uses BRAM as both program memory and data memory.

These could be improved by implementing a pipelined processor also by

implementing a simple cache controller that could be connected to DDR-

82

memory. As a result of this, executing larger programs and storing large

data structures such as frame buffers could be possible.

The system uses the MultiBoot feature and the command sequence that is

sent through the ICAP primitive to support the read-back of configuration

data from ICAP. However, there are two different ways for reading and

writing the configuration data from ICAP. As illustrated in (Fritzell, 2013)

“Either clock is left toggling and clock enable is used to control throughput,

or clock enable is kept high and the clock signal is controlled to achieve

wanted throughput” with implementing ICAP interface.

Adding more advanced modules for communication over COM-port.

Measuring the clock cycle of the reconfiguration by using Log with a

counter in the trap handler in order to reflect the number of clock cycles

from the time the counter starts until it is stopped.

The Nexus3 board has a seven segment electrical screen; it could be

exploited for testing.

Different benchmarks could be used to evaluate the soft-core on the FPGA.

The most standard benchmark is Dhrystone MIPs (DMIPs) and the result

from this could then be compared with the results we achieved with our

system.

83

Works Cited

Andersen, S. E., 2005. Bit Twiddling Hacks. [Online]

Available at:

http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive

[Accessed 31 August 2015].

Beckhoff, C., Koch, D. & Torresen, J., 2012. Go ahead: A partial reconfiguration

framework. Field-Programmable Custom Computing Machines (FCCM), 2012

IEEE 20th Annual International Symposium, pp. 37-44.

Bibda, C., 2007. Introduction to Reconfigurable Computing: Architectures,

Alhorithims and Applications. s.l.:Springer.

Bobda, C., 2007. Introduction to Reconfigurable Computing: Architectures,

Algorithims, and Applications. s.l.:Springer.

Bobda, C., 2008. Introduction to Reconfigurable Computing. Netherlands:

Springer .

Digilent, 2013. Nexys3™ Board Refference Manual. [Online]

Available at: https://www.digilentinc.com/Data/Products/NEXYS3/Nexys3_rm.pdf

Doulos.com, 2015. Simple Ram Model. [Online]

Available at:

https://www.doulos.com/knowhow/vhdl_designers_guide/models/simple_ram_mod

el/


Elkateeb, A., 2011. A Processor Design Course Project: Creating Soft-Core MIPS

Processor Using Step-by-Step Components' Integration Approach. International

Journal of Information and Education Technology, 1(5), pp. 432-440.

Fletcher, B., 2005. FPGA Embedded Processors Revealing True System

Performance. In: Embedded Training Program Embedded Systems Conference..

[Online]

Available at:

http://www.xilinx.com/products/design_resources/proc_central/resource/ETP-

367paper.pdf


84

Fritzell, A., 2013. A System for Fast Dynamic Partial Reconfiguration using

GoAhead Design and Implementation.. Masters Thesis: University of Oslo.

Galuzzi, C. & Bertels, K., 2011. The Instruction-Set Extension Problem: A Survey.

ACM Transactions on Reconfigurable Technology and Systems. article 18, 4(2).

Gebotys, C. H., 2012. A network flow approach to memory bandwidth utilization in

embedded DSP core processors. IEEE Transactions On Very Large Scale

Integration (Vlsi) Systems, 10(4), pp. 390-398.

Hansen, S. G., Koch, D. & Torresen, J., 2011. High speed partial runtime

reconfiguration using enhanced icap hard macro. In: Parallel and Distributed

Processing Workshops and Icap hard macro. Shanghai: IEEE, pp. 174-180.

Hauck, S., 1998. Configuration prefetch for single context reconfigurable

coprocessors. In: Proceedings of the 1998 ACM/SIGDA sixth international

symposium on Field programmable gate arrays. New York: ACM, pp. 65-74.

Hauck, S. & Wilson, W. D., 1999. Run Length Compression Techniques for FPGA

Configurations. Napa Valley, IEEE.

Jo, J., 2013. 6 Basic Phases of Software Development Life Cycle (SDLC). [Online]

Available at: http://www.techknol.net/2013/04/software-development-life-cycle.html


Koch, D., 2013. Partial Reconfiguration on FPGAs: Architectures, Tools and

Applications. New York: Springer.

Koch, D., Beckhoff, C. & Torreson, J., 2010. Zero logic overhead integration of

partially reconfigurable modules. Proceedings of the 23rd symposium on

Integrated circuits and system design, pp. 103-108.

Kozyrakis, C. E. & Patterson, D. A., 2004. Scalable, vector processors for

embedded systems. Micro, IEEE, 23(6), pp. 36-45.

Kuon, I. & Rose, J., 2007. Measuring the Gap Between FPGAs and ASICs.. IEEE

Transactions on Computer-Aided Design of Integrated Circuits and Systems,

26(2), pp. 203-215.

Lysaght, P. & Subrahmanyam, P. A., 2005. Guest Editors’ Introduction: Advances

in Configurable Computing. EEE Design & Test of Computers, 22(2), pp. 85-89.

85

Miller, J., 2004. The Chicago guide to writing about numbers. Chicago: University

of Chicago Press.

Minev, P. B. & Kukenska, V. S., 2007. Implemenation of Soft-core Processors in

FPGAs. Gabrovo, International Scientific Conference.

MIPS Technologies, 2003. MIPS32™ Architecture For Programmers Volume II:

The MIPS32™ Instruction Set. [Online]

Available at: http://www.cs.cornell.edu/courses/cs3410/2008fa/mips_vol2.pdf


OutputLogic.com, 2013. OutputLogic.com. [Online]

Available at: http://outputlogic.com/


Pittman, R. N., Lynch, N. L. & Forin, A., 2006. eMIPS, A Dynamically Extensible

Processor, Redmond: Microsoft Research.

Synopsys, 2010. SiliconBlue Selects Synopsys as FPGA Synthesis Partner for Its

iCE65 mobileFPGA Family. [Online]

Available at: http://news.synopsys.com/index.php?s=20295&item=123144

[Accessed 30 March 2015].

Tapp, S., 2010. Configuring Xilinx FPGAs with SPI Serial Flash. 1st ed. [ebook]

Xilinx.Inc.. [Online]

Available at:

http://www.xilinx.com/support/documentation/application_notes/xapp951.pdf

[Accessed 1 September 2015].

Wold, A., Koch, D. & Torresen, J., 2012. Design techniques for increasing

performance and resource utilization of reconfigurable soft CPUs. s.l., IEEE, pp.

50-55.

Xilinx Inc, 2011. Spartan-6 FPGA Block RAM Re-sources User Guide. [Online]

Available at: http://www.xilinx.com/support/documentation/user_guides/ug383.pdf


Xilinx Inc, 2015. Spartan-6 FPGA Configuration User Guide. [Online]

Available at: http://www.xilinx.com/support/documentation/user_guides/ug380.pdf


86

Xilinx, 2012. Partial Configuration User Guide. [Online]

Available at:

http://www.xilinx.com/support/documentation/sw_manuals/xilinx14_1/ug702.pdf


Xilinx, 2013. ISE Design Suite. [Online]

Available at: http://www.xilinx.com/products/design-tools/ise-design-suite.html

[Accessed 1 May 2015].

Yiannacouras, P., Steffan, J. G. & Rose, J., 2006. Application-Specific

Customization of Soft Processor Microarchitecture. Proceedings of the 2006

ACM/SIGDA 14th international symposium on Field programmable gate arrays,

pp. 201-210.

87

Appendix A - MIPS CPU

library ieee;

use ieee.std_logic_1164.all;

use ieee.numeric_std.all;

use std.textio.all;

entity MIPS_CPU is

port (

clk : in std_logic;

reset : in std_logic;

WaitRequest : in std_logic;

D_write_en : out std_logic;

D_read_en : out std_logic;

I_ADR : out std_logic_vector (31 downto 0);

I_DATA : in std_logic_vector (31 downto 0);

D_ADR : out std_logic_vector (31 downto 0);

D_W_DATA : out std_logic_vector (31 downto 0);

D_R_DATA : in std_logic_vector (31 downto 0);

RES_0 : in std_logic_vector(31 downto 0);

opCode : out std_logic_vector(5 downto 0);

OP_A_c : out std_logic_vector(31 downto 0);

OP_B_c : out std_logic_vector(31 downto 0);

trap_start : out std_logic;

OP_A : out std_logic_vector(31 downto 0);

OP_B : out std_logic_vector(31 downto 0));

end MIPS_CPU;

architecture a_MIPS_CPU of MIPS_CPU is

type Instruction_type_type is (Undefined,R_type, ADDI, ADDIU, SLTI,

SLTIU, ANDI, ORI, XORI, LUI, J, BNE, BEQ, load, store, JAL, BRANCHES,

BGTZ, BLEZ, I_type_special_2);

type PC_type is (Normal, branchs,Jumbs);

signal Instruction_type : Instruction_type_type;

signal PCstate : PC_type;

signal local_D_write_en : std_logic;

signal rs, rt, rd, sa : std_logic_vector(4 downto 0);

signal W_ADR : std_logic_vector(4 downto 0);

signal R_DATA_A, R_DATA_B : std_logic_vector(31 downto 0);

signal ALU_out : std_logic_vector(31 downto 0);

signal ALU_out64 : std_logic_vector(63 downto 0);

signal PC, PC4, nextPC, branchPC,jumbpc : std_logic_vector(31 downto

0);

signal RegFile_en : std_logic;

signal instr : std_logic_vector(5 downto 0);

signal funct : std_logic_vector(5 downto 0);

signal immediate, SL2immediate : std_logic_vector(31 downto 0);

signal immediateU : std_logic_vector(31 downto 0);

signal immediateJ : std_logic_vector(27 downto 0);

signal BranchTaken, branching : std_logic;

signal JumpTaken, JumpTakenJR : std_logic;

signal idata : std_logic_vector(31 downto 0);

signal HI, LO : std_logic_vector(31 downto 0);

signal WaitRequest_i : std_logic;

signal WaitRequest_comb : std_logic;

signal mul_wait : std_logic;

signal MTHI, MTLO : std_logic;

signal mul_taken : std_logic;

type memtype is array (31 downto 0) of std_logic_vector(31 downto 0);

signal RegFile : memtype := (others => (others => '0'));

88

begin

OP_A_c <= R_DATA_A;

OP_B_c <= R_DATA_B;

opCode <= funct;

--------------------------------------

--------REGISTER_FILE ---------------

-------------------------------------

p_write : process (clk)

begin

if clk'event and clk = '1' then

if WaitRequest_comb = '1' then

if RegFile_en = '1' and (W_ADR /= (W_ADR'range => '0')) then

if (Instruction_type = load) then

RegFile(to_integer(unsigned(W_ADR))) <= D_R_DATA;

else

RegFile(to_integer(unsigned(W_ADR))) <= ALU_out;

end if;

end if; -- RegFileEnable

end if; --Waitrequest

end if; --clk

end process;

R_DATA_A <= RegFile(to_integer(unsigned(rs)));

R_DATA_B <= RegFile(to_integer(unsigned(rt)));

OP_A <= R_DATA_A;

OP_B <= R_DATA_B;

----------------------------------------------

------------INSTRUCTION_DECODER----------------

-----------------------------------------------

--setting idata to correct signals:

idata <= I_DATA;

funct <= idata(5 downto 0);

instr <= idata(31 downto 26);

rs <= idata(25 downto 21);

rd <= idata(15 downto 11);

rt <= idata(20 downto 16);

sa <= idata(10 downto 6);

--Immediate sign extended:

immediate(31 downto 16) <= (others => idata(15));

immediate(15 downto 0) <= idata(15 downto 0);

--Immediate unsigned:

immediateU <= x"0000" & idata(15 downto 0);

--Jump offset:

immediateJ <= idata(25 downto 0) & "00";

--Immediate sign extended and leftshift 2:

SL2immediate <= immediate(29 downto 0) & "00";

--Decoding instructions:

p_INS_DECOER : process (instr)

89

begin

----------R_TYPE

if (std_match(instr, "000000")) then Instruction_type <=

R_type;

elsif (std_match(instr, "011100")) then Instruction_type <=

I_type_special_2; -- I-type instruction SPECIAL 2 custom instruction

---------I_TYPE

elsif (std_match(instr, "001001")) then Instruction_type <= ADDIU;

elsif (std_match(instr, "001001")) then Instruction_type <= ADDIU;

elsif (std_match(instr, "001000")) then Instruction_type <= ADDI;

elsif (std_match(instr, "001011")) then Instruction_type <= SLTIU;

elsif (std_match(instr, "001100")) then Instruction_type <= ANDI;

elsif (std_match(instr, "001101")) then Instruction_type <= ORI;

elsif (std_match(instr, "001110")) then Instruction_type <= XORI;

elsif (std_match(instr, "001111")) then Instruction_type <= load;-

-LUI

elsif (std_match(instr, "001010")) then Instruction_type <= SLTI;

-- slti

elsif (std_match(instr, "101011")) then Instruction_type <= store; -

- store instruction

elsif (std_match(instr, "101000")) then Instruction_type <= store; -

- store byte instruction

elsif (std_match(instr, "100011")) then Instruction_type <= load; --

load instruction

-- elsif (std_match(instr, "100000")) then Instruction_type <= load;

-- load byte instruction

---------BRANCHES

elsif (std_match(instr, "000100")) then Instruction_type <= BEQ;

elsif (std_match(instr, "000101")) then Instruction_type <= BNE;

elsif (std_match(instr, "000111")) then Instruction_type <= BGTZ;

elsif (std_match(instr, "000110")) then Instruction_type <= BLEZ;

elsif (std_match(instr, "000001")) then Instruction_type <=

BRANCHES; -- BLTZ,BGEZ,BGEZAL,BLTZAL

--------J_TYPE

elsif (std_match(instr, "000010")) then Instruction_type <= J; --

jump instruction

elsif (std_match(instr, "000011")) then Instruction_type <= JAL; --

jal (jump and link)

else Instruction_type <= Undefined;

report " +++ unimplemented instruction type !! ";

end if;

end process;

----------------------------------------------

WaitRequest_i <= '0' when (mul_taken = '1' and mul_wait = '0') else

'1';

WaitRequest_comb <= WaitRequest and WaitRequest_i;

----------------------for multiplication instructions 2 cycle

InstMulreg : process(clk)

begin

if rising_edge(clk) then

if MTHI = '1' then

HI <= R_DATA_A;

elsif mul_wait = '1' then

HI <= ALU_out64(63 downto 32);

end if;

if MTLO = '1' then

LO <= R_DATA_A;

elsif mul_wait <= '1' then

90

LO <= ALU_out64(31 downto 0);

end if;

if mul_taken = '1' and mul_wait = '0' then

mul_wait <= '1';

else

mul_wait <= '0';

end if;

end if;

end process;

-------------------------for sending the trap in case of custom

instructions

process(Instruction_type, funct)

begin

if(Instruction_type = I_type_special_2)then

if (funct = "010000" or funct = "010001" or funct =

"100000" or funct = "100001") then --I:CUST

trap_start <= '1';

else

trap_start <= '0';

end if;

else

trap_start <= '0';

end if;

end process;

-------------------------------------------

------------- ALU -------------------------

--------------------------------------------

D_write_en <= local_D_write_en;

D_ADR <= ALU_out;

D_W_DATA <= R_DATA_B;

------

p_ALU: process (PC, hi, lo, WaitRequest_comb ,RES_0,

Instruction_type, funct, instr, rt, rd, rs, sa, immediate, immediateU,

SL2immediate, R_DATA_A, R_DATA_B, ALU_out, ALU_out64, W_ADR)

begin

--initialising values:

ALU_out <= (others => '0');

ALU_out64 <= (others => '0');

JumpTaken <= '0';

BranchTaken <= '0';

W_ADR <= (others => '0');

RegFile_en <= '0';

D_read_en <= '0';

local_D_write_en <= '0';

JumpTakenJR <= '0';

MTHI <= '0';

MTLO <= '0';

mul_taken <= '0';

case Instruction_type is

when R_type => RegFile_en <= '1';

W_ADR <= rd;

case funct is

91

when B"00_00_00" => ALU_out <=

std_logic_vector(unsigned(R_DATA_B) SLL to_integer(unsigned(sa))); --

I:SLL

when B"00_00_10" => ALU_out <=

std_logic_vector(unsigned(R_DATA_B) SRL to_integer(unsigned(sa))); --

I:SRL

when B"00_01_10" => ALU_out <=

std_logic_vector(unsigned(R_DATA_B) SRL to_integer(unsigned(R_DATA_A)));

--I:SRLV

when B"00_01_00" => ALU_out <=

std_logic_vector(unsigned(R_DATA_B) SLL to_integer(unsigned(R_DATA_A)));

--I:SLLV

when B"00_00_11" => ALU_out <=

std_logic_vector(signed(R_DATA_B) SRL to_integer(unsigned(sa))); --I:SRA

when B"00_01_11" => ALU_out <=

std_logic_vector(signed(R_DATA_B) SRL to_integer(unsigned(R_DATA_A))); --

I:SRAV

when B"10_10_10" => if signed(R_DATA_A) < signed(R_DATA_B)

then --I:SLT

ALU_out <= x"00000001"; --I:SLT

else --I:SLT

ALU_out <= (others => '0'); --I:SLT

end if; --I:SLT

when B"10_10_11" => if unsigned(R_DATA_A) <

unsigned(R_DATA_B) then --I:SLTU

ALU_out <= x"00000001"; --I:SLTU

else --I:SLTU

ALU_out <= (others => '0'); --I:SLTU

end if; --I:SLTU

when B"10_00_01" => ALU_out <=

std_logic_vector(unsigned(R_DATA_A) + unsigned(R_DATA_B)); --I:ADDU

when B"10_00_00" => ALU_out <=

std_logic_vector(signed(R_DATA_A) + signed(R_DATA_B)); --I:ADD

when B"10_00_10" => ALU_out <=

std_logic_vector(signed(R_DATA_A) - signed(R_DATA_B)); --I:SUB

when B"10_00_11" => ALU_out <=

std_logic_vector(unsigned(R_DATA_A) - unsigned(R_DATA_B)); --I:SUBU

when B"10_01_00" => ALU_out <= R_DATA_A and R_DATA_B; --I:AND

when B"10_01_01" => ALU_out <= R_DATA_A or R_DATA_B; --I:OR

when B"10_01_10" => ALU_out <= R_DATA_A xor R_DATA_B; --I:XOR

when B"10_01_11" => ALU_out <= R_DATA_A nor R_DATA_B; --I:NOR

when B"01_00_00" => ALU_out <= HI; --I:MFHI

when B"01_00_10" => ALU_out <= LO; --I:MFLO

when B"01_00_01" => MTHI <= '1'; --I:MTHI

when B"01_00_11" => MTLO <= '1'; --I:MTLO

when B"00_10_00" => JumpTakenJR <= '1'; --I:JR

RegFile_en <= '0'; --I:JR

when B"00_10_01" => ALU_out <= std_logic_vector(unsigned(PC)

+ 8); --I:JALR

JumpTakenJR <= '1'; --I:JALR

when B"00_10_11" => ALU_out <= R_DATA_A; --I:MOVN

if R_DATA_B = x"00000000" then --I:MOVN

RegFile_en <= '0'; --I:MOVN

end if; --I:MOVN

when B"00_10_10" => ALU_out <= R_DATA_A; --I:MOVZ

if R_DATA_B /= x"00000000" then --I:MOVZ

RegFile_en <= '0'; --I:MOVZ

end if; --I:MOVZ

when B"01_10_00" => mul_taken <= '1';

ALU_out64 <=

std_logic_vector(signed(R_DATA_A) * signed(R_DATA_B));

92

when B"01_10_01" => mul_taken <= '1';

ALU_out64 <=

std_logic_vector(unsigned(R_DATA_A) * unsigned(R_DATA_B));

when others => report " +++ unimplemented instruction type !!

";

end case;

----------------------------------------------I_TYPE

when ADDIU => RegFile_en <= '1'; --I:ADDIU

W_ADR <= rt; --I:ADDIU

ALU_out <= std_logic_vector(unsigned(R_DATA_A) +

unsigned(immediate)); --I:ADDIU

when ADDI => RegFile_en <= '1'; --I:ADDI

W_ADR <= rt; --I:ADDI

ALU_out <= std_logic_vector(signed(R_DATA_A) +

signed(immediate)); --I:ADDI

when SLTIU =>RegFile_en <= '1'; --I:SLTIU

W_ADR <= rt; --I:SLTIU

if unsigned(R_DATA_A) < unsigned(immediateU) then --

I:SLTIU

ALU_out <= (0 => '1', others => '0'); --I:SLTIU

else --I:SLTIU

ALU_out <= (others => '0'); --I:SLTIU

end if; --I:SLTIU

when SLTI =>RegFile_en <= '1'; --I:SLTI

W_ADR <= rt; --I:SLTI

if signed(R_DATA_A) < signed(immediate) then --I:SLTI

ALU_out <= (0 => '1', others => '0'); --I:SLTI

else --I:SLTI

ALU_out <= (others => '0'); --I:SLTI

end if; --I:SLTI

when ANDI =>RegFile_en <= '1'; --I:ANDI

W_ADR <= rt; --I:ANDI

ALU_out <= R_DATA_A and immediateU; --I:ANDI

when ORI =>RegFile_en <= '1'; --I:ORI

W_ADR <= rt; --I:ORI

ALU_out <= R_DATA_A or immediateU; --I:ORI

when XORI =>RegFile_en <= '1'; --I:XORI

W_ADR <= rt; --I:XORI

ALU_out <= R_DATA_A xor immediateU; --I:XORI

-------------------------- load instruction

when load =>

RegFile_en <= '1';

W_ADR <= rt;

case instr is

when B"10_00_11" =>

ALU_out <=

std_logic_vector(signed(immediate) + signed(R_DATA_A)); --I:LW


D_read_en<=

'1'; --I:LW

when B"10_00_00" =>

ALU_out <=

std_logic_vector(signed(immediate) + signed(R_DATA_A)); --I


D_read_en <=

'1'; --I

when B"00_11_11" =>

93

ALU_out <= immediate(15 downto 0) &

X"0000"; --I:LUI


when others => report " +++ unimplemented load instruction !!

";

end case;

------------------------------------ store instruction

when store =>

case instr is

-- ALU_out == address

-- address = memory[base+offset], base 25-21, offset 15-0

when B"10_10_11" => ALU_out <=

std_logic_vector(signed(immediate) + signed(R_DATA_A)); --I:SW


when B"10_10_00" => ALU_out <=

std_logic_vector(signed(immediate) + signed(R_DATA_A)); --I:SW


report " +++ store byte executed as store

word !! ";

when others => report " +++ unimplemented store instruction !!

";

end case;

---------------------------------------------JAMP

when J =>JumpTaken <= '1'; --I:J

when JAL =>RegFile_en <= '1'; --I:JAL

W_ADR <= "11111"; --I:JAL

ALU_out <= std_logic_vector(unsigned(PC) + 8); --I:JAL

JumpTaken <= '1'; --I:JAL

------------------------------------------CUSTOMS

when I_type_special_2 =>

RegFile_en <= '1'; --I:?

W_ADR <= rd; --I:?

if (funct = "010000" or funct = "010001" or funct = "100000"

or funct = "100001") then --I:CUST

ALU_out <= RES_0; --I:CUST

report " +++ not custom instruction type !! ";

end if;

-------------------------------------------BRANCHES

when BNE => if R_DATA_A /= R_DATA_B then --I:BNE

BranchTaken <= '1'; --I:BNE

end if; --I:BNE

when BEQ => if R_DATA_A = R_DATA_B then --I:BEQ

BranchTaken <= '1'; --I:BEQ

end if; --I:BEQ

when BGTZ =>if signed(R_DATA_A) > x"00000000" then --I:BGTZ

BranchTaken <= '1'; --I:BGTZ

end if; --I:BGTZ

when BLEZ =>if signed(R_DATA_A) <= x"00000000" then --I:BLEZ

BranchTaken <= '1'; --I:BLEZ

end if; --I:BLEZ

when BRANCHES => if rt = "00000" then

if signed(R_DATA_A) < x"00000000" then --I:BLTZ

BranchTaken <= '1'; --I:BLTZ

end if;

94

elsif rt = "00001" then --I:BGEZ

if signed(R_DATA_A) >= x"00000000" then --

I:BGEZ

BranchTaken <= '1'; --I:BGEZ

end if; --I:BGEZ

elsif rt = "10001" then --I:BGEZAL

W_ADR <= "11111"; --I:BGEZAL

ALU_out <= std_logic_vector(unsigned(PC) +

8); --I:BGEZAL

if signed(R_DATA_A) >= x"00000000" then --

I:BGEZAL

BranchTaken <= '1'; --I:BGEZAL

end if; --I:BGEZAL

elsif rt = "10000" then --I:BLTZAL

W_ADR <= "11111"; --I:BLTZAL

ALU_out <= std_logic_vector(unsigned(PC) +

8); --I:BLTZAL

if signed(R_DATA_A) <= x"00000000" then --

I:BLTZAL

BranchTaken <= '1'; --I:BLTZAL

end if; --I:BLTZAL

end if;

-------------------------------------------

when Undefined =>report " +++ undefined instruction !! ";

when others =>report " +++ unimplemented instruction type !! ";

end case;

end process;

-----------------------------------------------

-------------------PROGRAM-COUNTER-------------

-----------------------------------------------

--Immediate sign extended and leftshift 2:

SL2immediate <= immediate(29 downto 0) & "00";

nextPC <= PC4 when PCstate= Normal else

branchPC when PCstate= branchs else

JumbPC when

PCstate=Jumbs else

PC4 ;

------------------------------------------

I_ADR <= nextPC when WaitRequest_comb = '1' else PC;

PC4 <= std_logic_vector(unsigned(PC) + 4);

process (clk)

begin

if clk'event and clk = '1' then

if WaitRequest_comb = '1' then

if reset = '1' then

PCstate <= Normal;

PC <= X"BFC00000"; ---MIPS reset address

branchPC <= X"BFC00000"; ---MIPS reset address

JumbPC <= X"BFC00000"; ---MIPS reset address

else

PC <= nextPC;

---------------------------

case PCstate is

-- "Normal" state of PC:

when Normal =>

--If a branch is taken:

if BranchTaken = '1' then

95

PCstate <= branchs;

branchPC <= std_logic_vector(signed(PC4) +

signed(SL2immediate));

--If a jump is taken:

elsif JumpTaken = '1' then

PCstate <= Jumbs;

JumbPC <= PC4(31 downto 28) & immediateJ;

-- If a jump from register is taken:

elsif JumpTakenJR = '1' then

PCstate <= Jumbs;

JumbPC <= R_DATA_A;

else

PCstate <= Normal;

end if;

-- branch and jumb state of PC:

when branchs =>

PCstate <= Normal;

when Jumbs=>

PCstate<=Normal;

when others =>

PCstate <= Normal;

end case;-- case PCstate

--------------------------------

end if;--rest

end if;--wait

end if;--clk

end process;

-------------------------------

end;

96

Appendix B - Trap handler based on MUX

library IEEE;

use IEEE.STD_LOGIC_1164.ALL;

----------------------------------------------

entity trapHandler is

Port (

clk : IN std_logic;

-- clk100 : in std_logic;

reset : IN std_logic;

address : IN std_logic_vector(31 DOWNTO 0);

opcode : in std_logic_vector (5 downto 0);

writedata : IN std_logic_vector(31 DOWNTO 0);

commandIn : in STD_LOGIC_VECTOR (31 downto 0);

readdata : OUT std_logic_vector(31 DOWNTO 0);

WaitRequest : in std_logic

);

end trapHandler;

architecture Behavioral of trapHandler is

component custom1_module is

port ( data_in : in std_logic_vector (31 downto 0);

crc_en , reset, clk : in std_logic;

crc_out : out std_logic_vector (31 downto 0));

end component;

------------------------------------------



one_out : out std_logic_vector (31 downto 0));

end component;

------------------------------------------



parity_out : out std_logic_vector (31 downto 0));

end component;

----------------------------------------------



zero_out : out std_logic_vector (31 downto 0));

end component;

----------------------------------------------

signal reg :std_logic_vector(31 downto 0);

signal sel1 : std_logic;




signal readdata1 : std_logic_vector(31 downto 0);




-------------------------------------------------

begin

reg<=writedata; --op_A and writedata

inst1: custom1_module PORT MAP(

data_in => reg,

crc_en => sel1,

reset => reset,

clk => clk,

crc_out => readdata1

97

);

---------------------------------------------


data_in => reg,

one_out => readdata2

);

---------------------------------------------


data_in => reg,

parity_out => readdata3

);

----------------------------------------------

inst4 : custom4_module PORT MAP(

data_in => reg,

zero_out => readdata4

);

-------------------------------------

sel1 <= '1' when (opcode = "010000") else '0';




--------------------------------------

process(readdata1, readdata2, readdata3, readdata4, sel1, sel2, sel3,

sel4)

begin

if(sel1 = '1')then

readdata <= readdata1;

elsif(sel2 = '1')then






else

readdata <= (others => '0');

end if;

end process;

end Behavioral;

98

Appendix C - Trap handler based on ICAP

-------------------------------------------------------------------------

---------

library IEEE;

use IEEE.STD_LOGIC_1164.ALL;

use ieee.std_logic_unsigned.all;

entity trapHandler is

Port (

opcode : in std_logic_vector (5 downto 0);

dataIn : in STD_LOGIC_VECTOR (31 downto 0);

commandIn : in STD_LOGIC_VECTOR (31 downto 0);

trap_start : in std_logic;

clk : in STD_LOGIC;

rst : in STD_LOGIC;

dataOut : out STD_LOGIC_VECTOR (31 downto 0));

end trapHandler;

architecture Behavioral of trapHandler is

component custom_module is


start , rst, clk : in std_logic;

CustomInstID : out std_logic_vector (5 downto 0);

done : out std_logic;

data_out : out std_logic_vector (31 downto 0));

end component;

component ICAP_SPARTAN6 is

port (

clk : in std_logic;

ce : in std_logic;

WRITE : in std_logic;

I : in std_logic_vector(15 downto 0);

O : out std_logic_vector(15 downto 0);

busy : out std_logic

);

end component;

signal start : std_logic;

signal custom_done : std_logic;

TYPE st IS (st0, st1, st2, st3);

SIGNAL currentState, nextState: st;

signal command_register : std_logic_vector(223 downto 0);

signal command_register_reg : std_logic_vector(223 downto 0);

signal MB_StartAddr : std_logic_vector(23 downto 0);

signal FB_StartAddr : std_logic_vector(23 downto 0);

constant MB_StartAddr1 : std_logic_vector(23 downto 0):=

X"100000";


X"200000";


X"300000";


X"400000";

99

constant FB_StartAddr1 : std_logic_vector(23 downto 0):=

X"100000";


X"200000";

constant FB_StartAddr3 : std_logic_vector(23 downto 0):=X"300000";


X"400000";

signal custom_start : std_logic;

signal icap_datain : std_logic_vector(15 downto 0);

signal icap_dataout : std_logic_vector(15 downto 0);

signal icap_busy : std_logic;

signal icap_write : std_logic;

signal count : std_logic_vector(3 downto 0);

signal opcode1 : std_logic_vector(7 downto 0):= X"00";

signal opcode2 : std_logic_vector(7 downto 0):= X"00";

signal CustomInstID : std_logic_vector(5 downto 0);

begin

--instantiate the ICAP module

ICAP_inst: ICAP_SPARTAN6

port map(

clk => clk,

ce => (not rst),

WRITE => icap_write,

I => icap_datain,

O => icap_dataout,

busy => icap_busy

);

--select ICAP write or read command

icap_write <= '1' when (currentState = st1) else '0';

--send the data to the ICAP module from the command_register_reg

icap_datain <= command_register_reg(223 downto 208);

--implement a shift register to hold the command, which need to be

sent to the ICAP module

process(clk,rst)

begin

if(rst = '0') then

command_register_reg <= (others => '0');

elsif(rising_edge(clk))then

if(currentState = st1)then

--shift left, 16 places

command_register_reg <= command_register_reg(207 downto 0) &

command_register_reg(223 downto 208);

else

command_register_reg <= command_register;

end if;

end if;

end process;

--command, that is to be sent to the ICAP module

command_register <= X"FFFF" & X"AA99" & X"5566" & X"3261"

MB_StartAddr(15 downto 0) & X"3281" & opcode1 & MB_StartAddr(23 downto

16) & X"32A1" & FB_StartAddr(15 downto 0) & X"32C1" & opcode2 &

FB_StartAddr(23 downto 16) & X"30A1" & X"000E" & X"2000";

100

--Master bitstream address selection on the basis of the opcode

MB_StartAddr <= MB_StartAddr1 when (opcode = "010000") else

MB_StartAddr2 when (opcode = "100001") else



(others => '0');

--Feedback bitstream address selection on the basis of the opcode

FB_StartAddr <= FB_StartAddr1 when (opcode = "010000") else

FB_StartAddr2 when (opcode = "100001") else



(others => '0');

--assign nextState to the currentState on the clock edge

process(clk,rst)

begin

if(rst = '0') then

currentState <= st0;


currentState <= nextState;

end if;

end process;

--decide nextState on the basis of currentState, count, trap_start

and custom_done

process(currentState, count, trap_start, custom_done)

begin

case (currentState) is

--st0 is the reset state, here it will wait for the

trap_start signal

when st0 =>

if(trap_start = '1')then

--if current loaded custom instruction is same as

the required one, then go to st2

--else go to st1

if(opcode = CustomInstID)then

nextState <= st2;

else

nextState <= st1;

end if;

else

nextState <= ST0;

end if;

when st1 =>

--in st1, the command to the ICAP module is sent in the 14 clock cycles

--here it will check the counter, if its equal to 13, then move to ST2

if(count = "1101")then

nextState <= ST2;

else

nextState <= ST1;

end if;

when st2 =>

nextState <= st3;

when st3 =>

101

--now start the custom module, to run the custom

command

if(custom_done = '1')then

nextState <= st0;

else

nextState <= st3;

end if;

when others =>

nextState <= st0;

end case;

end process;

--implement a counter, which is used while sending command to the

ICAP module

process(clk,rst)

begin

if(rst = '0') then

count <= (others => '0');


-- if currentState is st1, then count

if(currentState = st1)then

count <= count + '1';

else

count <= (others => '0');

end if;

end if;

end process;

--instantiate the custom instruction module

inst1: custom_module PORT MAP(

CustomInstID => CustomInstID,

data_in => dataIn,

start => start,

rst => rst,

clk => clk,

done => custom_done,

data_out => dataOut

);

--start the custom module, when state = st3

custom_start <= '1' when (currentState = st3) else '0';

start <= custom_start when ((opcode = "010000") or (opcode =

"100001") or (opcode = "010001") or (opcode = "100000")) else '0';

end Behavioral;

RUN-TIME CUSTOMIZATION OF A SOFT-CORE CPU ON · PDF fileRUN-TIME CUSTOMIZATION OF A SOFT-CORE...

Documents

Transcript of RUN-TIME CUSTOMIZATION OF A SOFT-CORE CPU ON · PDF fileRUN-TIME CUSTOMIZATION OF A SOFT-CORE...